I'm trying to create separate data.frame
objects based on levels of a factor. So if I have:
df <- data.frame(
x=rnorm(25),
y=rnorm(25),
g=rep(factor(LETTERS[1:5]), 5)
)
How can I split df
into separate data.frame
s for each level of g
containing the corresponding x
and y
values? I can get most of the way there using split(df, df$g)
, but I'd like the each level of the factor to have its own data.frame
.
What's the best way to do this?
You can also do the following: split(x = df, f = ~ var1 + var2...) This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.
The Pandas provide the feature to split Dataframe according to column index, row index, and column values, etc.
I think that split
does exactly what you want.
Notice that X is a list of data frames, as seen by str
:
X <- split(df, df$g)
str(X)
If you want individual object with the group g names you could assign the elements of X from split
to objects of those names, though this seems like extra work when you can just index the data frames from the list split
creates.
#I used lapply just to drop the third column g which is no longer needed.
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, 1:2])
#Assign the dataframes in the list Y to individual objects
A <- Y[[1]]
B <- Y[[2]]
C <- Y[[3]]
D <- Y[[4]]
E <- Y[[5]]
#Or use lapply with assign to assign each piece to an object all at once
lapply(seq_along(Y), function(x) {
assign(c("A", "B", "C", "D", "E")[x], Y[[x]], envir=.GlobalEnv)
}
)
Edit Or even better than using lapply
to assign to the global environment use list2env
:
names(Y) <- c("A", "B", "C", "D", "E")
list2env(Y, envir = .GlobalEnv)
A
Since dplyr 0.8.0
, we can also use group_split
which has similar behavior as base::split
library(dplyr)
df %>% group_split(g)
#[[1]]
# A tibble: 5 x 3
# x y g
# <dbl> <dbl> <fct>
#1 -1.21 -1.45 A
#2 0.506 1.10 A
#3 -0.477 -1.17 A
#4 -0.110 1.45 A
#5 0.134 -0.969 A
#[[2]]
# A tibble: 5 x 3
# x y g
# <dbl> <dbl> <fct>
#1 0.277 0.575 B
#2 -0.575 -0.476 B
#3 -0.998 -2.18 B
#4 -0.511 -1.07 B
#5 -0.491 -1.11 B
#....
It also comes with argument .keep
(which is TRUE
by default) to specify whether or not the grouped column should be kept.
df %>% group_split(g, .keep = FALSE)
#[[1]]
# A tibble: 5 x 2
# x y
# <dbl> <dbl>
#1 -1.21 -1.45
#2 0.506 1.10
#3 -0.477 -1.17
#4 -0.110 1.45
#5 0.134 -0.969
#[[2]]
# A tibble: 5 x 2
# x y
# <dbl> <dbl>
#1 0.277 0.575
#2 -0.575 -0.476
#3 -0.998 -2.18
#4 -0.511 -1.07
#5 -0.491 -1.11
#....
The difference between base::split
and dplyr::group_split
is that group_split
does not name the elements of the list based on grouping. So
df1 <- df %>% group_split(g)
names(df1) #gives
NULL
whereas
df2 <- split(df, df$g)
names(df2) #gives
#[1] "A" "B" "C" "D" "E"
data
set.seed(1234)
df <- data.frame(
x=rnorm(25),
y=rnorm(25),
g=rep(factor(LETTERS[1:5]), 5)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With