Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

the rules of subsetting

Having df1 and df2 as follows:

df1 <- read.table(text =" x y z
                          1 1 1
                          1 2 1
                          1 1 2
                          2 1 1
                          2 2 2",header=TRUE)

df2 <- read.table(text =" a b c
                          1 1 1
                          1 2 8
                          1 1 2
                          2 6 2",header=TRUE)

I can ask of the data a bunch of things like:

 df2[ df2$b == 6 | df2$c == 8 ,] #any rows where b=6 plus c=8 in df2
 #and additive conditions
 df2[ df2$b == 6 & df2$c == 8 ,] # zero rows

between data.frame:

 df1[ df1$z %in% df2$c ,] # rows in df1 where values in z are in c (allrows)

This gives me all rows:

 df1[ (df1$x %in%  df2$a) &
      (df1$y %in%  df2$b) &
      (df1$z %in%  df2$c) ,]

but shouldn't this give me all rows of df1 too:

 df1[ df1$z %in% df2$c | df1$b == 9,]

What I am really hoping to do is to subset df1 an df2 on three column conditions, so that I only get rows in df1 where a,b,c all equal x,y,z at the same time within a row. In real data i will have more than 3 columns but I will still want to subset on 3 additive column conditions.

So subsetting my example data df1 on df2 my result would be:

df1
   1 1 1
   1 1 2

Playing with syntax has confusedme more and the SO posts are all variaion of what I want that actually lead to more confusion for me.

I figured out I can do this:

 merge(df1,df2, by.x=c("x","y","z"),by.y=c("a","b","c"))

which gives me what I want, but I would like to understand why I am wrong in my [ attempts.

like image 937
user1320502 Avatar asked Jan 30 '13 11:01

user1320502


People also ask

What does subsetting do?

What is data subsetting? Test data subsetting is extracting a smaller sized – referential intact – set of data from a 'production' database to a non-production environment.

What does subsetting a vector mean?

The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.

What does subsetting a dataset mean?

Subsetting a SAS data set means extracting a part of the data set by selecting a fewer number of variables or fewer number of observations or both.

Why does the command Mtcars 1/20 return an error How does it differ from the similar command Mtcars 1 20?

Q4: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ] ? A: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of columns. So, mtcars[1:20] would return a data frame containing the first 20 columns of the dataset.


1 Answers

In addition to your nice solution using merge (thanks for that, I always forget merge), this can be achieved in base using ?interaction as follows. There may be other variations of this, but this is the one I am familiar with:

> df1[interaction(df1) %in% interaction(df2), ]

Now to answer your question: First, I think there's a typo (corrected) in:

df1[ df1$z %in% df2$c | df2$b == 9,] # second part should be df2$b == 9

You would get an error, because the first part evaluates to

[1] TRUE TRUE TRUE TRUE TRUE

and the second evaluates to:

[1] FALSE FALSE FALSE FALSE

You do a | operation on unequal lengths getting the error:

longer object length is not a multiple of shorter object length

Edit: If you have multiple columns then you can choose the interaction as such. For example, if you want to get from df1 the rows where the first two columns match with that of df2, then you could simply do:

> df1[interaction(df1[, 1:2]) %in% interaction(df2[, 1:2]), ]
like image 130
Arun Avatar answered Sep 22 '22 15:09

Arun