Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter out rows from one data.frame that are present in another data.frame

Tags:

Suppose I have a larger data.frame and a smaller one. If the smaller one is contained inside the larger one, how can I subtract the rows of the smaller data.frame, leaving a result with the difference:

Larger - Smaller

Example:

Small data.frame:

     ID       CSF1PO CSF1PO.1 D10S1248 D10S1248.1 D12S391 D12S391.1 203079_BA_M     10       11       14         16      -9        -9 203079_BA_F      8       12       14         17      -9        -9 203080_BA_M     10       12       13         13      -9        -9 

Big data.frame:

      ID      CSF1PO CSF1PO.1 D10S1248 D10S1248.1 D12S391 D12S391.1 203078_MG_M     -9       -9       15         15      18        20 203078_MG_F     -9       -9       14         15      17        19 203079_BA_M     10       11       14         16      -9        -9 203079_BA_F      8       12       14         17      -9        -9 203080_BA_M     10       12       13         13      -9        -9 203080_BA_F     10       11       14         16      -9        -9 203081_MG_M     10       12       14         16      -9        -9 203081_MG_F     11       12       15         16      -9        -9 203082_MG_M     11       11       13         15      -9        -9 203082_MG_F     11       11       13         14      -9        -9 

The small data.frame corresponds to the rows 3, 4 and 5 of the larger data.frame.

like image 267
vitor Avatar asked May 01 '13 23:05

vitor


People also ask

How do I filter specific rows from a DataFrame?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you filter a DataFrame based on another DataFrame PySpark?

PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.


2 Answers

Try this:

BigDF[ !(BigDF$ID %in% SmallDF$ID), ] 
like image 90
Ferdinand.kraft Avatar answered Oct 14 '22 19:10

Ferdinand.kraft


In dplyr:

library(dplyr)  setdiff(BigDF, SmallDF) 

More Info: Hadley's dply cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Concise Set Operations functions with examples http://rpackages.ianhowson.com/cran/dplyr/man/setops.html (But the entire Grammar of Data Manipulation is a great resource overall)

And although the below is not in direct answer to your question - it is frequently related for me (and has been very useful)

If you wish to capture the new changes that have occured between a new dataframe and a previous version of the same dataframe (inside the same records) you will want to make your code look as below:

setdiff(NewDF, OldDF) 
like image 40
leerssej Avatar answered Oct 14 '22 20:10

leerssej