I have got a dataframe like this:
import pandas as pd
data = {
'c1': ['Test1','Test2','NULL','Test3',' ','Test4','Test4','Test1',"Test3"],
'c2': [' ','Test1',' ','NULL',' ','NULL','NULL','NULL','NULL'],
'c3': [0,0,0,0,0,1,5,0,0],
'c4': ['NULL', 'Test2', 'Test1','Test1', 'Test2', 'Test2','Test1','Test1','Test2']
}
df = pd.DataFrame(data)
df
The dataframe looks like this:
c1 c2 c3 c4
0 Test1 0 NULL
1 Test2 Test1 0 Test2
2 NULL 0 Test1
3 Test3 NULL 0 Test1
4 0 Test2
5 Test4 NULL 1 Test2
6 Test4 NULL 5 Test1
7 Test1 NULL 0 Test1
8 Test3 NULL 0 Test2
I want to drop all columns, that have more than 60 % of "empty" values. "Empty" means in my case that the values are for example: ' ', 'NULL' or 0. There are strings (c1, c2, c4) as well as integers (c3).
The result should be a dataframe with columns c1 and c4 only.
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
I have no idea how to handle that problem. Only thing that comes to my mind is something like
df.loc[:, (df != 0).any(axis=0)]
to delete all columns where all values are 0, 'NULL' and so on.
Use DataFrame.isin
for check all formats and then get mean
for treshold and filter by boolean indexing
with loc
:
print (df.isin([' ','NULL',0]))
c1 c2 c3 c4
0 False True True True
1 False False True False
2 True True True False
3 False True True False
4 True True True False
5 False True False False
6 False True False False
7 False True True False
8 False True True False
print (df.isin([' ','NULL',0]).mean())
c1 0.222222
c2 0.888889
c3 0.777778
c4 0.111111
dtype: float64
df = df.loc[:, df.isin([' ','NULL',0]).mean() < .6]
print (df)
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
you can drop the columns using dropna thresh
parameter:
In [58]: df = df.replace([0,' ','NULL'],np.nan)
In[59]: df
Out[59]:
c1 c2 c3 c4
0 Test1 NaN NaN NaN
1 Test2 Test1 NaN Test2
2 NaN NaN NaN Test1
3 Test3 NaN NaN Test1
4 NaN NaN NaN Test2
5 Test4 NaN 1.0 Test2
6 Test4 NaN 5.0 Test1
7 Test1 NaN NaN Test1
8 Test3 NaN NaN Test2
In [60]: df.dropna(thresh=df.shape[0]*0.6,how='all',axis=1)
Out[60]:
c1 c4
0 Test1 NaN
1 Test2 Test2
2 NaN Test1
3 Test3 Test1
4 NaN Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With