I am having a dataset with dates and company names. I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice. To illustrate the problem, let us assume I have the following dataframe: <pre class="prettyprint lang-py prettyprint-override"><code>df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'], ['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword']) </code></pre> My desired output would be: <pre class="prettyprint lang-py prettyprint-override"><code>df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['28/02/2017', 'WalMart'], ['28/02/2017', 'WalMart']]), columns=['date', 'keyword']) </code></pre> I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset. Could anyone provide some insight?

Use <code>DataFrame.duplicated</code> with specify columns for check dupes and <code>keep=False</code> for return all dupe rows by <code>boolean indexing</code>: <pre class="prettyprint"><code>df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)] print (df2) date keyword 0 28/02/2017 Apple 1 28/02/2017 Apple 4 28/02/2017 WalMart 5 28/02/2017 WalMart </code></pre> If need specify number of rows use <code>GroupBy.transform</code> with count by <code>GroupBy.size</code>: <pre class="prettyprint"><code>df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2] </code></pre> If small DataFrame or performance is not important use filter: <pre class="prettyprint"><code>df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2) print (df2) date keyword 0 28/02/2017 Apple 1 28/02/2017 Apple 4 28/02/2017 WalMart 5 28/02/2017 WalMart </code></pre>

<pre class="prettyprint"><code>df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna() </code></pre> Output <pre class="prettyprint"><code> date keyword 0 28/02/2017 Apple 1 28/02/2017 Apple 4 28/02/2017 WalMart 5 28/02/2017 WalMart </code></pre>

Drop rows in pandas if records in two columns do not appear together at least twice in the dataset

Tags:

python

pandas

dataframe

I am having a dataset with dates and company names. I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice.

To illustrate the problem, let us assume I have the following dataframe:

df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword'])

My desired output would be:

df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'],
                             ['28/02/2017', 'WalMart'],
                             ['28/02/2017', 'WalMart']]), columns=['date', 'keyword'])

I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset.

Could anyone provide some insight?

223

asked Jul 08 '19 10:07

arctic.queenolina

2 Answers

Use DataFrame.duplicated with specify columns for check dupes and keep=False for return all dupe rows by boolean indexing:

df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)]
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

If need specify number of rows use GroupBy.transform with count by GroupBy.size:

df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2]

If small DataFrame or performance is not important use filter:

df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2)
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

answered Oct 25 '22 21:10

jezrael

df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna()

Output

         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

answered Oct 25 '22 20:10

iamklaus

Related questions
                            
                                How to integrate Wikidata query in python
                            
                                Pandas rolling apply function to entire window dataframe
                            
                                Splitting on / inside a list in Python
                            
                                Add path to sys.path vs. PEP E402
                            
                                Pandas Merge and filter
                            
                                Question related to super() with __init__()
                            
                                Why do I not have to define the variable in a for loop using range(), but I do have to in a while loop in Python?
                            
                                How to crop multiple rectangles or squares from JPEG?
                            
                                How do I solve the leap year function in Python for Hackerrank?
                            
                                Read and dump [bracket, list] from and to yaml with python
                            
                                Is there a more pythonic way to write multiple comparisons
                            
                                PySpark explode stringified array of dictionaries into rows
                            
                                ModuleNotFoundError when using importlib.import_module
                            
                                Pandas Timestamp rounds 30 seconds inconsistently
                            
                                How to create a Pandas DataFrame from dictionary of dataframes?
                            
                                Perform operations after styling in a dataframe
                            
                                Missing values in Pandas Pivot table?
                            
                                Optimizing suggestions for a piece of Julia and Python code
                            
                                Remove string element in a list of strings if the first characters match with another string element in the list
                            
                                DiGraph parallel ordering

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With