Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop rows in pandas if records in two columns do not appear together at least twice in the dataset

I am having a dataset with dates and company names. I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice.

To illustrate the problem, let us assume I have the following dataframe:

df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword'])

My desired output would be:

df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'],
                             ['28/02/2017', 'WalMart'],
                             ['28/02/2017', 'WalMart']]), columns=['date', 'keyword'])

I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset.

Could anyone provide some insight?

like image 223
arctic.queenolina Avatar asked Jul 08 '19 10:07

arctic.queenolina


People also ask

How do you drop rows in pandas based on multiple column values?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

How do you drop a row on condition panda?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do you drop all columns except some in pandas?

Select All Except One Column Using drop() Method in pandas In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched. This gives you a DataFrame with all columns with out one unwanted column.


2 Answers

Use DataFrame.duplicated with specify columns for check dupes and keep=False for return all dupe rows by boolean indexing:

df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)]
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

If need specify number of rows use GroupBy.transform with count by GroupBy.size:

df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2]

If small DataFrame or performance is not important use filter:

df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2)
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart
like image 87
jezrael Avatar answered Oct 25 '22 21:10

jezrael


df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna()

Output

         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart
like image 31
iamklaus Avatar answered Oct 25 '22 20:10

iamklaus