Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.

Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?

My initial thoughts are:

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

like image 707
Randy Olson Avatar asked Mar 06 '16 12:03

Randy Olson


People also ask

How do you identify categorical variables?

Step 1: Read the problem and identify the variables described. Note key properties of the variables, such as what types of values the variables can take. Step 2: Identify any variables from step 1 that take on values from a limited number of possible values with no particular ordering. These variables are categorical.

How do you know if a variable is categorical or continuous in Python?

The key distinction is that continuous variables have an infinite number of values between any values a and b. Categorical variables don't! A great example is your body weight in kilograms.


1 Answers

Here are a couple of approaches:

  1. Find the ratio of number of unique values to the total number of unique values. Something like the following

    likely_cat = {} for var in df.columns:     likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold 
  2. Check if the top n unique values account for more than a certain proportion of all values

    top_n = 10  likely_cat = {} for var in df.columns:     likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold 

Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

like image 188
Rishabh Srivastava Avatar answered Sep 22 '22 09:09

Rishabh Srivastava