I have a data frame with repeatedly occurring rows with different names. I want to delete less occurring rows. My data frame is very big. I am giving only a small size here.
dataframe:
df =
name value
0 A 10
1 B 20
2 A 30
3 A 40
4 C 50
5 C 60
6 D 70
In the above data frame B and D rows occurred fewer times. That is less than 1. I want to delete/drop all such rows that occur less than 2.
My code:
##### Net strings
net_strs = df['name'].unique().tolist()
strng_list = df.group.unique().tolist()
tempdf = df.groupby('name').count()
##### strings that have less than 2 measurements in whole data set
lesstr = tempdf[tempdf['value']<2].index
##### Strings that have more than 2 measurements in whole data set
strng_list = np.setdiff1d(net_strs,lesstr).tolist()
##### Removing the strings with less measurements
df = df[df['name']==strng_list]
My present output:
ValueError: Lengths must match to compare
My expected output:
name value
0 A 10
1 A 30
2 A 40
3 C 50
4 C 60
You could find the count of each element in name and then select rows only those rows having names that occur more than once.
v = df.name.value_counts()
df[df.name.isin(v.index[v.gt(1)])]
Output :
name value
0 A 10
2 A 30
3 A 40
4 C 50
5 C 60
I believe this code should give you what you want.
df['count'] = df.groupby('name').transform('count')
df2 = df.loc[df['count'] >= 2].drop(columns='count')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With