I am trying to subset a data frame, where I get multiple data frames based on multiple column values. Here is my example
>df v1 v2 v3 v4 v5 A Z 1 10 12 D Y 10 12 8 E X 2 12 15 A Z 1 10 12 E X 2 14 16
The expected output is something like this where I am splitting this data frame into multiple data frames based on column v1
and v2
>df1 v3 v4 v5 1 10 12 1 10 12 >df2 v3 v4 v5 10 12 8 >df3 v3 v4 v5 2 12 15 2 14 16
I have written a code which is working right now but don't think that's the best way to do it. There must be a better way to do it. Assuming tab
is the data.frame having the initial data. Here is my code:
v1Factors<-levels(factor(tab$v1)) v2Factors<-levels(factor(tab$v2)) for(i in 1:length(v1Factors)){ for(j in 1:length(v2Factors)){ subsetTab<-subset(tab, v1==v1Factors[i] & v2==v2Factors[j], select=c("v3", "v4", "v5")) print(subsetTab) } }
Can someone suggest a better method to do the above?
Method 1 : Using plyr package rbind. fill() method in R is an enhancement of the rbind() method in base R, is used to combine data frames with different columns. The column names are number may be different in the input data frames. Missing columns of the corresponding data frames are filled with NA.
Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
You can also do the following: split(x = df, f = ~ var1 + var2...) This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.
You are looking for split
split(df, with(df, interaction(v1,v2)), drop = TRUE) $E.X v1 v2 v3 v4 v5 3 E X 2 12 15 5 E X 2 14 16 $D.Y v1 v2 v3 v4 v5 2 D Y 10 12 8 $A.Z v1 v2 v3 v4 v5 1 A Z 1 10 12
As noted in the comments
any of the following would work
library(microbenchmark) microbenchmark( split(df, list(df$v1,df$v2), drop = TRUE), split(df, interaction(df$v1,df$v2), drop = TRUE), split(df, with(df, interaction(v1,v2)), drop = TRUE)) Unit: microseconds expr min lq median uq max neval split(df, list(df$v1, df$v2), drop = TRUE) 1119.845 1129.3750 1145.8815 1182.119 3910.249 100 split(df, interaction(df$v1, df$v2), drop = TRUE) 893.749 900.5720 909.8035 936.414 3617.038 100 split(df, with(df, interaction(v1, v2)), drop = TRUE) 895.150 902.5705 909.8505 927.128 1399.284 100
It appears interaction
is slightly faster (probably due the fact that the f = list(...)
are just converted to an interaction within the function)
Edit
If you just want use the subset data.frames then I would suggest using data.table for ease of coding
library(data.table) dt <- data.table(df) dt[, plot(v4, v5), by = list(v1, v2)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With