I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object
), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1)
to work fine, but 2)
hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2)
didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2)
obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
Step 1: Read the problem and identify the variables described. Note key properties of the variables, such as what types of values the variables can take. Step 2: Identify any variables from step 1 that take on values from a limited number of possible values with no particular ordering. These variables are categorical.
The key distinction is that continuous variables have an infinite number of values between any values a and b. Categorical variables don't! A great example is your body weight in kilograms.
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat = {} for var in df.columns: likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
Check if the top n unique values account for more than a certain proportion of all values
top_n = 10 likely_cat = {} for var in df.columns: likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With