Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - filter dataframe by another dataframe by row elements

I have a dataframe df1 which looks like:

   c  k  l 0  A  1  a 1  A  2  b 2  B  2  a 3  C  2  a 4  C  2  d 

and another called df2 like:

   c  l 0  A  b 1  C  a 

I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:

d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))] 

That seems to me too complicated, it returns:

   c  k  l 2  B  2  a 4  C  2  d 

but I'm expecting:

   c  k  l 0  A  1  a 2  B  2  a 4  C  2  d 
like image 207
Fabio Lamanna Avatar asked Oct 22 '15 13:10

Fabio Lamanna


People also ask

How do you select rows from a DataFrame based on another DataFrame?

Create another data frame using the random() function and randomly selecting the rows of the first dataset. Now we will use dataframe. loc[] function to select the row values of the first data frame using the indexes of the second data frame.

How do I filter specific rows from a DataFrame pandas?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.


2 Answers

You can do this efficiently using isin on a multiindex constructed from the desired columns:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],                     'k': [1, 2, 2, 2, 2],                     'l': ['a', 'b', 'a', 'a', 'd']}) df2 = pd.DataFrame({'c': ['A', 'C'],                     'l': ['b', 'a']}) keys = list(df2.columns.values) i1 = df1.set_index(keys).index i2 = df2.set_index(keys).index df1[~i1.isin(i2)] 

enter image description here

I think this improves on @IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).


(Above answer is an edit. Following was my initial answer)

Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],                     'k': [1, 2, 2, 2, 2],                     'l': ['a', 'b', 'a', 'a', 'd']}) df2 = pd.DataFrame({'c': ['A', 'C'],                     'l': ['b', 'a']})  # create a column marking df2 values df2['marker'] = 1  # join the two, keeping all of df1's indices joined = pd.merge(df1, df2, on=['c', 'l'], how='left') joined 

enter image description here

# extract desired columns where marker is NaN joined[pd.isnull(joined['marker'])][df1.columns] 

enter image description here

There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.

like image 191
jakevdp Avatar answered Sep 26 '22 05:09

jakevdp


This is pretty succinct and works well:

df1 = df1[~df1.index.isin(df2.index)] 
like image 27
Haroon Hassan Avatar answered Sep 26 '22 05:09

Haroon Hassan