Suppose I've data similar to following:
index id name value value2 value3 data1 val5 0 345 name1 1 99 23 3 66 1 12 name2 1 99 23 2 66 5 2 name6 1 99 23 7 66
How can we drop all those columns like (value
, value2
, value3
) where all rows have the same values, in one command or couple of commands using python?
Consider we have many columns similar to value
, value2
, value3
...value200
.
Output:
index id name data1 0 345 name1 3 1 12 name2 2 5 2 name6 7
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database.
What we can do is use nunique
to calculate the number of unique values in each column of the dataframe, and drop the columns which only have a single unique value:
In [285]: nunique = df.nunique() cols_to_drop = nunique[nunique == 1].index df.drop(cols_to_drop, axis=1) Out[285]: index id name data1 0 0 345 name1 3 1 1 12 name2 2 2 5 2 name6 7
Another way is to just diff
the numeric columns, take abs
values and sums
them:
In [298]: cols = df.select_dtypes([np.number]).columns diff = df[cols].diff().abs().sum() df.drop(diff[diff== 0].index, axis=1) Out[298]: index id name data1 0 0 345 name1 3 1 1 12 name2 2 2 5 2 name6 7
Another approach is to use the property that the standard deviation will be zero for a column with the same value:
In [300]: cols = df.select_dtypes([np.number]).columns std = df[cols].std() cols_to_drop = std[std==0].index df.drop(cols_to_drop, axis=1) Out[300]: index id name data1 0 0 345 name1 3 1 1 12 name2 2 2 5 2 name6 7
Actually the above can be done in a one-liner:
In [306]: df.drop(df.std()[(df.std() == 0)].index, axis=1) Out[306]: index id name data1 0 0 345 name1 3 1 1 12 name2 2 2 5 2 name6 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With