Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a data frame based on range of column values in R?

Tags:

split

r

subset

I have a data set like this:

Users   Age
1        2
2        7
3        10
4        3
5        8
6        20

How do I split this data set into 3 data sets where the first consists of all users with ages between 0–5, second is 6–10 and third is 11–15?

like image 986
user136482 Avatar asked Jul 11 '14 23:07

user136482


People also ask

How do I subdivide a DataFrame in R?

Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.

How do I split data in columns in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.

How do you split data into subsets in R?

To split the data frame in R, use the split() function. You can split a data set into subsets based on one or more variables representing groups of the data.

How do you split a data frame in two variables?

You can also do the following: split(x = df, f = ~ var1 + var2...) This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.


3 Answers

You can combine split with cut to do this in a single line of code, avoiding the need to subset with a bunch of different expressions for different data ranges:

split(dat, cut(dat$Age, c(0, 5, 10, 15), include.lowest=TRUE))
# $`[0,5]`
#   Users Age
# 1     1   2
# 4     4   3
# 
# $`(5,10]`
#   Users Age
# 2     2   7
# 3     3  10
# 5     5   8
# 
# $`(10,15]`
# [1] Users Age  
# <0 rows> (or 0-length row.names)

cut splits up data based on the specified break points, and split splits up a data frame based on the provided categories. If you stored the result of this computation into a list called l, you could access the smaller data frames with l[[1]], l[[2]], and l[[3]] or the more verbose:

l$`[0,5]`
l$`(5,10]`
l$`(10, 15]`
like image 194
josliber Avatar answered Oct 05 '22 15:10

josliber


First, here's your dataset for my purposes: foo=data.frame(Users=1:6,Age=c(2,7,10,3,8,20))

Here's your first dataset with ages 0–5: subset(foo,Age<=5&Age>=0)

  Users Age
1     1   2
4     4   3

Here's your second with ages 6–10: subset(foo,Age<=10&Age>=6)

  Users Age
2     2   7
3     3  10
5     5   8

Your third (using subset(foo,Age<=15&Age>=11)) is empty – your last Age observation is over 15.

Note also that fractional ages between 5 and 6 or 10 and 11 (e.g., 5.1, 10.5) would be excluded, as this code matches your question very literally. If you'd want someone with an age less than 6 to go in the first group, just amend that code to subset(foo,Age<6&Age>=0). If you'd prefer a hypothetical person with Age=5.1 in the second group, that group's code would be subset(foo,Age<=10&Age>5).

like image 37
Nick Stauner Avatar answered Oct 05 '22 14:10

Nick Stauner


We could also use the between function from the data.table package.

# Create a data frame
dat <- data.frame(Users = 1:7, Age = c(2, 7, 10, 3, 8, 12, 15))

# Convert the data frame to data table by reference
# (data.table is also a data.frame)
setDT(dat)

# Define a list with the cut pairs
cuts <- list(c(0, 5), c(6, 10), c(11, 15))

# Cycle through dat and cut it into list of data tables by the values in Age
# matching the defined cuts
lapply(X = cuts, function(i) {
  dat[between(x = dat[ , Age], lower = i[1], upper = i[2])]
})

Output:

[[1]]
   Users Age
1:     1   2
2:     4   3

[[2]]
   Users Age
1:     2   7
2:     3  10
3:     5   8

[[3]]
   Users Age
1:     6  12
2:     7  15

Many other things are possible, including doing it by group, data.table is rather flexible.

like image 33
panman Avatar answered Oct 05 '22 15:10

panman