Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

quickly drop dataframe columns with only one distinct value

Tags:

python

pandas

Is there a faster way to drop columns that only contain one distinct value than the code below?

cols=df.columns.tolist() for col in cols:     if len(set(df[col].tolist()))<2:         df=df.drop(col, axis=1) 

This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

like image 232
Alexis Eggermont Avatar asked Oct 15 '15 09:10

Alexis Eggermont


People also ask

How do I drop a column with the same value?

To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.

How do I drop all columns except one in a DataFrame?

Select All Except One Column Using drop() Method in pandas Note that drop() is also used to drop rows from pandas DataFrame. In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched.

How extract unique values from multiple columns in Pandas?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.


2 Answers

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -

for col in df.columns:     if len(df[col].unique()) == 1:         df.drop(col,inplace=True,axis=1) 

A method that does not do inplace dropping -

res = df for col in df.columns:     if len(df[col].unique()) == 1:         res = res.drop(col,axis=1) 

Demo -

In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])  In [155]: for col in df.columns:    .....:     if len(df[col].unique()) == 1:    .....:         df.drop(col,inplace=True,axis=1)    .....:  In [156]: df Out[156]:    1 0  2 1  3 2  2 

Timing results -

In [166]: %paste def func1(df):         res = df         for col in df.columns:                 if len(df[col].unique()) == 1:                         res = res.drop(col,axis=1)         return res  ## -- End pasted text --  In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})  In [178]: %timeit func1(df) 1000 loops, best of 3: 1.05 ms per loop  In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns] 100 loops, best of 3: 8.81 ms per loop  In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1) 100 loops, best of 3: 5.81 ms per loop 

The fastest method still seems to be the method using unique and looping through the columns.

like image 179
Anand S Kumar Avatar answered Sep 19 '22 21:09

Anand S Kumar


One step:

df = df[[c for c         in list(df)         if len(df[c].unique()) > 1]] 

Two steps:

Create a list of column names that have more than 1 distinct value.

keep = [c for c         in list(df)         if len(df[c].unique()) > 1] 

Drop the columns that are not in 'keep'

df = df[keep] 

Note: this step can also be done using a list of columns to drop:

drop_cols = [c for c              in list(df)              if df[c].nunique() <= 1] df = df.drop(columns=drop_cols) 
like image 22
kait Avatar answered Sep 21 '22 21:09

kait