Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset rows in a data frame based on a vector of values

Tags:

r

r-faq

subset

I have two data sets that are supposed to be the same size but aren't. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a graph that's going into a report. (Don't worry, this data isn't being permanently deleted!)

I have read the following:

  • Selecting columns in R data frame based on those *not* in a vector
  • http://www.ats.ucla.edu/stat/r/faq/subset_R.htm
  • How to combine multiple conditions to subset a data-frame using "OR"?

But I'm still not able to get this to work right. Here's my code:

bg2011missingFromBeg <- setdiff(x=eg2011$ID, y=bg2011$ID) #attempt 1 eg2011cleaned <- subset(eg2011, ID != bg2011missingFromBeg) #attempt 2 eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg] 

The first try just eliminates the first value in the resulting setdiff vector. The second try yields and unwieldy error:

Error in `[.data.frame`(eg2012, !eg2012$ID %in% bg2012missingFromBeg)  :  undefined columns selected 
like image 571
Zelbinian Avatar asked Mar 05 '13 15:03

Zelbinian


People also ask

How do you subset data based on a vector in R?

The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.

What function is used to subset a data frame according to the values of a variable in the data frame?

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [ .


2 Answers

This will give you what you want:

eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ] 

The error in your second attempt is because you forgot the ,

In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.

However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.

like image 182
adibender Avatar answered Sep 19 '22 11:09

adibender


If you really just want to subset each data frame by an index that exists in both data frames, you can do this with the 'match' function, like so:

data_A[match(data_B$index, data_A$index, nomatch=0),] data_B[match(data_A$index, data_B$index, nomatch=0),] 

This is, though, the same as:

data_A[data_A$index %in% data_B$index,] data_B[data_B$index %in% data_A$index,] 

Here is a demo:

# Set seed for reproducibility. set.seed(1)  # Create two sample data sets. data_A <- data.frame(index=sample(1:200, 90, rep=FALSE), value=runif(90)) data_B <- data.frame(index=sample(1:200, 120, rep=FALSE), value=runif(120))  # Subset data of each data frame by the index in the other. t_A <- data_A[match(data_B$index, data_A$index, nomatch=0),] t_B <- data_B[match(data_A$index, data_B$index, nomatch=0),]  # Make sure they match. data.frame(t_A[order(t_A$index),], t_B[order(t_B$index),])[1:20,]  #    index     value index.1    value.1 # 27     3 0.7155661       3 0.65887761 # 10    12 0.6049333      12 0.14362694 # 88    14 0.7410786      14 0.42021589 # 56    15 0.4525708      15 0.78101754 # 38    18 0.2075451      18 0.70277874 # 24    23 0.4314737      23 0.78218212 # 34    32 0.1734423      32 0.85508236 # 22    38 0.7317925      38 0.56426384 # 84    39 0.3913593      39 0.09485786 # 5     40 0.7789147      40 0.31248966 # 74    43 0.7799849      43 0.10910096 # 71    45 0.2847905      45 0.26787813 # 57    46 0.1751268      46 0.17719454 # 25    48 0.1482116      48 0.99607737 # 81    53 0.6304141      53 0.26721208 # 60    58 0.8645449      58 0.96920881 # 30    59 0.6401010      59 0.67371223 # 75    61 0.8806190      61 0.69882454 # 63    64 0.3287773      64 0.36918946 # 19    70 0.9240745      70 0.11350771 
like image 45
Dinre Avatar answered Sep 20 '22 11:09

Dinre