Tuesday, 12 December 2023

Spot The Bug: The Baby Name Dataframe Misgendering

It's a whole new episode of Spot The Bug. Time to let these bugs know who's boss!

In this episode, we will deal with some mishaps I encountered while screwing around with Python.

Just look at
all these bugs...

First of, there was a dataset of baby names that I copied from https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv.
"year","name","percent","sex"
1880,"John",0.081541,"boy"
1880,"William",0.080511,"boy"
1880,"James",0.050057,"boy"
1880,"Charles",0.045167,"boy"
1880,"George",0.043292,"boy"
1880,"Frank",0.02738,"boy"
1880,"Joseph",0.022229,"boy"
1880,"Thomas",0.021401,"boy"
1880,"Henry",0.020641,"boy"
.........
.........
.........

This CSV was loaded into Python, and then fiddled with. The idea was to make two copies, do whatever I needed with those copy, and then go back to the original dataset. Unfortunately, things did not quite turn out that way. In the first copy, data_boy, for the sake of neatness, I changed the SEX column's values from "boy" to "M". Then I turned it into a subset of itself where Gender was "M". In the second copy, data_girl, I did the same thing, changing "girl" to "F" and only retaining rows where the value was "F".
import pandas as pd

data = pd.read_csv("babynames.csv")

data_boy = data
data_girl = data
data_boy.replace(to_replace = "boy", value = "M", inplace=True )
data_boy = data_boy[(data_boy.sex == "M")]
data_girl.replace(to_replace = "girl", value = "F", inplace=True )
data_girl = data_girl[(data_girl.sex == "F")]

So far so good.
data_boy.head()


data_girl.head()


Then I noticed that there were boys named "Flora" in the dataset.
.........
.........
1880,"Erasmus",4.2e-05,"boy"
1880,"Esau",4.2e-05,"boy"
1880,"Everette",4.2e-05,"boy"
1880,"Firman",4.2e-05,"boy"
1880,"Fleming",4.2e-05,"boy"
1880,"Flora",4.2e-05,"boy"
1880,"Gardner",4.2e-05,"boy"
1880,"Gee",4.2e-05,"boy"
.........
.........

I decided to do a little filtering in the original dataframe, data, to see how many such rows there were.
data_flora = datadata_flora = data_flora.query('name == "Flora" & sex == "boy"')

But there were no rows!
data_flora.head()

What went wrong

When I made the first copy, data_boy, and then removed all "M" data, the original dataframe, data, mirrored those changes. Same for data_girls. Thus when I made the third copy from the original, it would only have "M" and "F", not "boy" and "girl"!

Why it went wrong

It was really simple, as it turned out. I had used the equality operator to copy the dataframe data, erroneously thinking that this was how it was done. Instead, all it did was make a shallow copy of data. Any changes to data_boy, data_girl and data_flora would be reflected in data.

How I fixed it

What I should have done was use the copy() method.
data_boy = data.copy()
data_girl = data.copy()
data_boy.replace(to_replace = "boy", value = "M", inplace=True )
data_boy = data_boy[(data_boy.sex == "M")]
data_girl.replace(to_replace = "girl", value = "F", inplace=True )
data_girl = data_girl[(data_girl.sex == "F")]

data_flora = data.copy()
data_flora = data_flora.query('name == "Flora" & sex == "boy"')

And there it was!
data_flora.head()

Moral of the story

Just because it works a certain way in other programming languages, does not mean jack. Well, not always, anyway. Some programming languages do things a certain way out of necessity.

The equality sign was a perfectly valid way of copying a dataframe if you want changes in either copy and original to reflect in both. But if you want an independent copy, you need to use the copy() method.

copy() that?
T___T

No comments:

Post a Comment