I am having a dataset with dates and company names. I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice.
To illustrate the problem, let us assume I have the following dataframe:
df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword'])
My desired output would be:
df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'],
['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart']]), columns=['date', 'keyword'])
I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset.
Could anyone provide some insight?
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
Select All Except One Column Using drop() Method in pandas In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched. This gives you a DataFrame with all columns with out one unwanted column.
Use DataFrame.duplicated
with specify columns for check dupes and keep=False
for return all dupe rows by boolean indexing
:
df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)]
print (df2)
date keyword
0 28/02/2017 Apple
1 28/02/2017 Apple
4 28/02/2017 WalMart
5 28/02/2017 WalMart
If need specify number of rows use GroupBy.transform
with count by GroupBy.size
:
df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2]
If small DataFrame or performance is not important use filter:
df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2)
print (df2)
date keyword
0 28/02/2017 Apple
1 28/02/2017 Apple
4 28/02/2017 WalMart
5 28/02/2017 WalMart
df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna()
Output
date keyword
0 28/02/2017 Apple
1 28/02/2017 Apple
4 28/02/2017 WalMart
5 28/02/2017 WalMart
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With