Tuesday 16 February 2021

Spot The Bug: Python Data Mislocation

Yo, you delightful bug-hunters. I'm pleased to bring to you a whole new edition of Spot The Bug!

Bugs, beware!


Today's subject is Python. Yes, His Teochewness has been tinkering yet again with forces beyond his ken, and a whole ton of pesky bugs is the result! Well, I exaggerate - just the one.

Python is a good language for data analysis, and I had data to analyze. Specifically, COVID-19 data. Since April of last year, I've been collecting data from the daily updates sent to me by the Singapore Government in the hopes of being able to derive some meaningful insight as to the trends regarding this disease.

Using the Python Pandas library, I managed to read the data for April 2020 into a dataframe.
import pandas as pd

df = pd.read_csv('covid_apr2020.csv')
df.set_index('RowNum', inplace = True)
df


This is what it looked like.



Just trying to plot a quick line chart here, to see the number of migrant worker infections, "MW" versus the number of Singapore citizen infections, "SG".
import pandas as pd

df = pd.read_csv('covid_apr2020.csv')
df.set_index('RowNum', inplace = True)
df

df.plot.line('Day', ['MW', 'SG'])



And there it was.

However, I wanted a small subset of that data. I wanted to take data from every seven days instead of every day. So this is what I did - I created a new dataframe, df_weekly, by taking only rows 7, 14, 21 and 28. Here, I also commented out the plotting function in order to see what kind of data I would get.
import pandas as pd

df = pd.read_csv('covid_apr2020.csv')
df.set_index('RowNum', inplace = True)
df
df_weekly = df.iloc[[7, 14, 21, 28]]
df_weekly


#df.plot.line('Day', ['MW', 'SG'])


What went wrong

This is the data I got... and it certainly wasn't what I wanted. Why was the data starting from 8th April instead of 7th April? I was getting data from 8th, 15th, 22nd and 29th instead.

Why it went wrong

You see, in programming, counts start from 0. And using the iloc() method meant that Python was using 7, 14, 21 and 28 to denote row numbers. While my row numbers (the RowNum column) started nicely from 1, as far as Python was concerned, row number 7 was actually the 8th row in the dataframe. Row number 14 was actually the 15th row. And so on.

How I fixed it

Instead of using the iloc() method, I used the loc() method. Now Python would use those numbers as a reference to the index of the dataframe (the "RowNum" column)
import pandas as pd

df = pd.read_csv('covid_apr2020.csv')
df.set_index('RowNum', inplace = True)
df
df_weekly = df.loc[[7, 14, 21, 28]]
df_weekly

#df.plot.line('Day', ['MW', 'SG'])


Now the data looked correct.


And to test my data, now I reactivated the last line and changed it to plot for df_weekly instead.
import pandas as pd

df = pd.read_csv('covid_apr2020.csv')
df.set_index('RowNum', inplace = True)
df
df_weekly = df.loc[[7, 14, 21, 28]]
df_weekly

df_weekly.plot.line('Day', ['MW', 'SG'])


The final chart!


Moral of the story

Some functions look similar but work very differently. The confusion between loc() and iloc() isn't the first time a novice Python programmer has been caught out, and won't be the last.

Remember, the "i" in iloc() stands for integer. So when we use iloc(), Python will search for rows by integer value instead of string reference to the index.

I've got my i on you!
T___T

No comments:

Post a Comment