Data processing – and, by association, pre-processing – are critical if one intends to train a machine learning model. What those specifically entail depends largely on the nature of the model, the task it’s being trained to accomplish, and the data set it’s being trained on. With that said, one of the first steps in pre-processing usually involves uniformization.
In a perfect world, all training data would contain the same data type. Unfortunately, we do not live in a perfect world. The majority of the data sets on which we train our machine learning models contain a mix of categorical and numerical data.
This is a non-issue for some of the more sophisticated machine learning models. They’re able to ingest and work with categorical data just as readily as anything else. The rest, however, simply aren’t designed to work with string labels, requiring all inputs and labels to be numeric.
Here’s where feature engineering comes in – specifically, one hot encoding.
What Is One Hot Encoding?
One hot encoding is one of several data-wrangling methods for converting categorical data into numerical data. Rather than simply assigning a numerical label to each categorical variable, one hot encoding converts each variable into a new categorical column. These columns are then each assigned a binary value of either 1 or 0, which is then mapped to a binary vector in which all values save the index are 0.
For example, say you’re preparing a data set that lists fabric by price.
Your first step in pre-processing this data is to convert each category – polyester, cotton, charmeuse, and silk – into numerical values, a process known as integer encoding. Since this data set is relatively straightforward, we can simply assign them values of 1-4. We’re now left with a data set that looks like this:
Once we’ve encoded our original categorical variables to integers, we can apply one hot encoding to those integers. We divide each category into its own column, then start by assigning a value of 1 to polyester and 0 to the other fabrics. We then repeat that process for cotton, charmeuse, and silk.
At this point, we’re left with the following output:
Next comes the creation of the binary vector. Since we have 4 values, our vector will have 4 as its length. The fabrics can then be represented as follows:
- Polyester: [1,0,0,0]
- Cotton: [0,1,0,0]
- Charmeuse: [0,0,1,0]
- Demin: [0,0,0,1]
It’s incredibly important to remain consistent when using and applying these values – otherwise, it may be difficult or impossible to restore the original categorical data later on.
Why Use One Hot Encoding?
Most machine learning models consider numerical order to be an attribute of significance. This means that they tend to put greater emphasis or priority on higher numbers as opposed to lower numbers. For some data sets, that’s a non-issue, and there are even edge cases where this tendency can be valuable.
The problem arises when categorical values are unranked and unrelated to one another. Allowing a model to treat said categories as ordinal can significantly impact performance and predictions. One hot coding is useful because it eliminates this tendency, removing numerical significance and ordinal relationships from your categorical variables.
For instance, let’s say you have a data set that includes the names of several electronics manufacturers – LG, Samsung, and Philips, mapped to integers 1, 2, and 3. When observing this raw label encoding, a machine learning model might automatically assign a higher weight to Philips simply because it’s a bigger number.
It gets more absurd, as well. In a model configured to calculate an arithmetic mean, you might end up presented with the firm conclusion that the average of LG, Samsung, and Philips is Samsung. Suffice it to say, this is something you probably want to avoid.
Benefits and Drawbacks of Using One Hot Encoding in Machine Learning
Like any machine learning technique, one hot encoding has both strengths and weaknesses.
Its first and most obvious benefit is that it allows categorical variables to be used in numerical-only models. Done right, this has the potential to considerably improve performance by providing the model with more information about each variable. As already mentioned, one hot encoding also eliminates ordinality from categorical variables.
Unfortunately, this technique can also lead to increased dimensionality. It introduces additional complexity into the model and training process by creating a separate column for each category. This has the potential to considerably slow down training.
On the opposite end of the spectrum, one hot encoding may also result in sparse data since one hot encoded columns assign most observations a value of 0 by default. Applying one hot encoding to multiple categories in a large data set with relatively small sample sizes may also result in overfitting. In short, while one hot encoding can be invaluable when treating categorical data, it’s not suitable for every situation.
That’s the bad news. The good news is that even if you apply one hot encoding to a data set for which it’s ill-suited, you can still potentially correct your error through other techniques like dimensionality reduction or feature selection.