I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data. Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical? My initial thoughts are: 1) If there are strings in the column (e.g., the column data type is <code>object</code>), then the column very likely contains categorical data 2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data I've found <code>1)</code> to work fine, but <code>2)</code> hasn't panned out very well. I need better heuristics. How would you solve this problem? Edit: Someone requested that I explain why <code>2)</code> didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in <code>2)</code> obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

Here are a couple of approaches: <ol> <li> Find the ratio of number of unique values to the total number of unique values. Something like the following <blockquote> <pre class="prettyprint"><code>likely_cat = {} for var in df.columns: likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold </code></pre> </blockquote> </li> <li> Check if the top n unique values account for more than a certain proportion of all values <blockquote> <pre class="prettyprint"><code>top_n = 10 likely_cat = {} for var in df.columns: likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold </code></pre> </blockquote> </li> </ol> Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

Tags:

python

pandas

scikit-learn

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.

Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?

My initial thoughts are:

1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data

2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data

I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?

Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.

707

asked Mar 06 '16 12:03

Randy Olson

1 Answers

Here are a couple of approaches:

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = {} for var in df.columns:     likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10  likely_cat = {} for var in df.columns:     likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

188

answered Sep 22 '22 09:09

Rishabh Srivastava

Related questions
                            
                                How do you implement __str__ for a function?
                            
                                Tensorflow estimator ValueError: logits and labels must have the same shape ((?, 1) vs (?,))
                            
                                Does a "with" statement support type hinting?
                            
                                Unpacking: [x,y], (x,y), x,y - what is the difference?
                            
                                What is your favorite solution for managing database migrations in django? [closed]
                            
                                How to write an application for the system tray in Linux
                            
                                Python: module for creating PID-based lockfile?
                            
                                Python - Why use anything other than uuid4() for unique strings?
                            
                                web.py and flask [closed]
                            
                                Python matplotlib restrict to integer tick locations
                            
                                IOError: [Errno 13] Permission denied when trying to open hidden file in "w" mode
                            
                                How to duplicate an estimator in order to use it on multiple data sets?
                            
                                Python memory consumption: dict VS list of tuples
                            
                                Is this the best way to add an extra dimension to a numpy array in one line of code?
                            
                                What is the purpose of checking self.__class__?
                            
                                Running powershell script within python script, how to make python print the powershell output while it is running
                            
                                Permanently set Python path for Anaconda within Cygwin
                            
                                Is there any shorthand for 'yield all the output from a generator'?
                            
                                Pandas: if row in column A contains "x", write "y" to row in column B
                            
                                How to get a classifier's confidence score for a prediction in sklearn?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With