Saturday, 19 August 2023

The How And Why Of One-Hot Encoding

Categorical data in a dataset are labels, usually in the form of strings. As I've mentioned before in a previous blogpost, they can also be numbers, but even if that were the case, they are numbers with no numerical relation to each other.

This normally does not present a problem; however, when categorical data needs to be input into a Machine Learning model, they need to be converted into numeric data.

Can't ingest letters...

The reason for this is that Machine Learning models currently do not work very well, or at all, with non-numeric data. To these models, the numbers need to have some sort of numerical relation to each other (as in the case with quantities, dates and such), or consist of 1s or 0s. When using such Machine Learning models, One-Hot Encoding (OHE) can be a boon as it allows you to have categorical data in your dataset.

How it works

Let's first use a sample column called Gender. Here's the column and its data.

Gender
M
M
F
M
M
F
F


For categorical data, we split that particular column into a number of columns equal to the number of possible values for that column. Thus, for a column like Gender, you might have two columns. Unless, of course, it's somewhere like Canada, in which case you might have more. Don't ask me why; it's how the world rolls right now.

The columns are renamed according to their data value and the original name, so we might have Gender_M and Gender_F. We sort the new columns alphabetically, so Gender_F comes before Gender_M. And then we convert the data into 1s and 0s. 1 being true and 0 being false. Thus, if the data is "M", it will show up as 1 under Gender_M and 0 under Gender_F.

Gender_F Gender_M
0 1
0 1
1 0
0 1
0 1
1 0
1 0


For a column with even more possible values, let's think of something like Shirt_Size. We'll place it next to the previously converted values for better clarity.
Gender_F Gender_M Shirt_Size
0 1 L
0 1 L
1 0 M
0 1 S
0 1 XL
1 0 L
1 0 XS


Here is the conversion.
Gender_F Gender_M Shirt_Size_L Shirt_Size_M Shirt_Size_S Shirt_Size_XL Shirt_Size_XS
0 1
1
0
0 0 0
0 1 1
0
0 0 0
1 0 0 1
0
0 0
0 1 0 0 1
0
0
0 1 0
0 0 1
0
1 0 1
0
0 0 0
1 0 0 0 0 0 1


Why not One-hot Encoding

Yes, OHE, like all solutions, is not ideal for every situation. These are some of the less desirable outcomes it can produce.

If there are many possible values for a column (such as a color code), this can lead to an unreasonably large number of columns. Basically, too many predictors (or data points) in the dataset.

Too many columns.

And if among these generated columns, only one column is a 1 while the others are 0, this leads to sparse data.

All this can lead to increased difficulty in training the Data Learning model.

Another issue, related to the previous points, is that OHE produces additional columns (and thus increased space) without actually adding to the value of the data. The data still means the same thing; it only takes up more space now. Kind of like if instead of this nice, short, to-the-point blogpost, I encapsulated the same points in a five-page article.

Conclusion

OHE is not the only method of preparing data for a Machine Learning model. There are others, of course, and perhaps on another occasion I can present them. Till then...

Have a One-derful day,
T___T

No comments:

Post a Comment