Is there a faster way to drop columns that only contain one distinct value than the code below?
cols=df.columns.tolist() for col in cols: if len(set(df[col].tolist()))<2: df=df.drop(col, axis=1)
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
Select All Except One Column Using drop() Method in pandas Note that drop() is also used to drop rows from pandas DataFrame. In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched.
Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.
You can use Series.unique()
method to find out all the unique elements in a column, and for columns whose .unique()
returns only 1
element, you can drop that. Example -
for col in df.columns: if len(df[col].unique()) == 1: df.drop(col,inplace=True,axis=1)
A method that does not do inplace dropping -
res = df for col in df.columns: if len(df[col].unique()) == 1: res = res.drop(col,axis=1)
Demo -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]]) In [155]: for col in df.columns: .....: if len(df[col].unique()) == 1: .....: df.drop(col,inplace=True,axis=1) .....: In [156]: df Out[156]: 1 0 2 1 3 2 2
Timing results -
In [166]: %paste def func1(df): res = df for col in df.columns: if len(df[col].unique()) == 1: res = res.drop(col,axis=1) return res ## -- End pasted text -- In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]}) In [178]: %timeit func1(df) 1000 loops, best of 3: 1.05 ms per loop In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns] 100 loops, best of 3: 8.81 ms per loop In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1) 100 loops, best of 3: 5.81 ms per loop
The fastest method still seems to be the method using unique
and looping through the columns.
df = df[[c for c in list(df) if len(df[c].unique()) > 1]]
Create a list of column names that have more than 1 distinct value.
keep = [c for c in list(df) if len(df[c].unique()) > 1]
Drop the columns that are not in 'keep'
df = df[keep]
Note: this step can also be done using a list of columns to drop:
drop_cols = [c for c in list(df) if df[c].nunique() <= 1] df = df.drop(columns=drop_cols)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With