Is there a faster way to drop columns that only contain one distinct value than the code below? <pre class="prettyprint"><code>cols=df.columns.tolist() for col in cols: if len(set(df[col].tolist()))<2: df=df.drop(col, axis=1) </code></pre> This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

You can use <code>Series.unique()</code> method to find out all the unique elements in a column, and for columns whose <code>.unique()</code> returns only <code>1</code> element, you can drop that. Example - <pre class="prettyprint"><code>for col in df.columns: if len(df[col].unique()) == 1: df.drop(col,inplace=True,axis=1) </code></pre> A method that does not do inplace dropping - <pre class="prettyprint"><code>res = df for col in df.columns: if len(df[col].unique()) == 1: res = res.drop(col,axis=1) </code></pre> <hr> Demo - <pre class="prettyprint"><code>In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]]) In [155]: for col in df.columns: .....: if len(df[col].unique()) == 1: .....: df.drop(col,inplace=True,axis=1) .....: In [156]: df Out[156]: 1 0 2 1 3 2 2 </code></pre> <hr> Timing results - <pre class="prettyprint"><code>In [166]: %paste def func1(df): res = df for col in df.columns: if len(df[col].unique()) == 1: res = res.drop(col,axis=1) return res ## -- End pasted text -- In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]}) In [178]: %timeit func1(df) 1000 loops, best of 3: 1.05 ms per loop In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns] 100 loops, best of 3: 8.81 ms per loop In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1) 100 loops, best of 3: 5.81 ms per loop </code></pre> The fastest method still seems to be the method using <code>unique</code> and looping through the columns.

<h3>One step:</h3> <pre class="prettyprint"><code>df = df[[c for c in list(df) if len(df[c].unique()) > 1]] </code></pre> <h3>Two steps:</h3> Create a list of column names that have more than 1 distinct value. <pre class="prettyprint"><code>keep = [c for c in list(df) if len(df[c].unique()) > 1] </code></pre> Drop the columns that are not in 'keep' <pre class="prettyprint"><code>df = df[keep] </code></pre> Note: this step can also be done using a list of columns to drop: <pre class="prettyprint"><code>drop_cols = [c for c in list(df) if df[c].nunique() <= 1] df = df.drop(columns=drop_cols) </code></pre>

quickly drop dataframe columns with only one distinct value

Tags:

python

pandas

Is there a faster way to drop columns that only contain one distinct value than the code below?

cols=df.columns.tolist() for col in cols:     if len(set(df[col].tolist()))<2:         df=df.drop(col, axis=1)

This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

232

asked Oct 15 '15 09:10

Alexis Eggermont

2 Answers

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -

for col in df.columns:     if len(df[col].unique()) == 1:         df.drop(col,inplace=True,axis=1)

A method that does not do inplace dropping -

res = df for col in df.columns:     if len(df[col].unique()) == 1:         res = res.drop(col,axis=1)

Demo -

In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])  In [155]: for col in df.columns:    .....:     if len(df[col].unique()) == 1:    .....:         df.drop(col,inplace=True,axis=1)    .....:  In [156]: df Out[156]:    1 0  2 1  3 2  2

Timing results -

In [166]: %paste def func1(df):         res = df         for col in df.columns:                 if len(df[col].unique()) == 1:                         res = res.drop(col,axis=1)         return res  ## -- End pasted text --  In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})  In [178]: %timeit func1(df) 1000 loops, best of 3: 1.05 ms per loop  In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns] 100 loops, best of 3: 8.81 ms per loop  In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1) 100 loops, best of 3: 5.81 ms per loop

The fastest method still seems to be the method using unique and looping through the columns.

179

answered Sep 19 '22 21:09

Anand S Kumar

One step:

df = df[[c for c         in list(df)         if len(df[c].unique()) > 1]]

Two steps:

Create a list of column names that have more than 1 distinct value.

keep = [c for c         in list(df)         if len(df[c].unique()) > 1]

Drop the columns that are not in 'keep'

df = df[keep]

Note: this step can also be done using a list of columns to drop:

drop_cols = [c for c              in list(df)              if df[c].nunique() <= 1] df = df.drop(columns=drop_cols)

answered Sep 21 '22 21:09

kait

Related questions
                            
                                Spoofing the origination IP address of an HTTP request
                            
                                What's the python __all__ module level variable for? [duplicate]
                            
                                Run Python/Django Management Command from a UnitTest/WebTest
                            
                                Easiest way to combine date and time strings to single datetime object using Python
                            
                                How to set the working directory for a Fabric task?
                            
                                Python - how can I get the class name from within a class method - using @classmethod
                            
                                Convert pandas DataFrame to a nested dict
                            
                                AttributeError: 'module' object has no attribute 'TestCase'
                            
                                Stop Sublime Text from executing infinite loop
                            
                                How do I get around HttpError 403 Insufficient Permission? (gmail api, python)
                            
                                Pythonic way of removing reversed duplicates in list
                            
                                How floor a date to the first date of that month?
                            
                                How can I concat multiple dataframes in Python? [duplicate]
                            
                                What is LLVM and How is replacing Python VM with LLVM increasing speeds 5x?
                            
                                In Python, how do I reference a class generically in a static way, like PHP's "self" keyword?
                            
                                Disable static file caching in Tornado
                            
                                What does os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)) mean? python
                            
                                How to make a query date in mongodb using pymongo?
                            
                                How do I create a link to another html page?
                            
                                Saving Matplotlib graphs to image as full screen

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With