Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a data.frame by a variable [duplicate]

Tags:

dataframe

r

I have data from several subjects stored in a single CSV file. After importing the CSV file, I would like to split the data from each participant off into its own data.frame.

More literally, I would like to take the example data below, and create three new data.frames; one for each of the 'subject_initials' values.

enter image description here

How do I do this? I've thus far looked into options using the plyr package and split(), but haven't yet found a solution. I know I'm probably missing something obvious.

like image 415
CaptainProg Avatar asked Sep 20 '13 09:09

CaptainProg


People also ask

How do you split a DataFrame into multiple data frames?

Here, we use the DataFrame. groupby() method for splitting the dataset by rows. The same grouped rows are taken as a single element and stored in a list. This list is the required output which consists of small DataFrames.

How do you separate data frames?

Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.

How do you separate duplicates in R?

Remove duplicate rows in a data frameThe function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It's an efficient version of the R base function unique() .


2 Answers

split seems to be appropriate here.

If you start with the following data frame :

df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5])

Then you can do :

split(df, df$ids)

And you will get a list of data frames :

R> split(df, df$ids)
$`1`
  ids x y
1   1 1 a
2   1 2 b

$`2`
  ids x y
3   2 3 c
4   2 4 d

$`3`
  ids x y
5   3 5 e
like image 154
juba Avatar answered Oct 19 '22 13:10

juba


split is a generic. Whereas split.default is quite fast, split.data.frame gets terribly slow when the number of levels to split on increases.

The alternate (faster) solution would be to use data.table. I'll illustrate the difference on a bigger data here:

Sample data (what @Roland was referring to in his comment)

require(data.table)
set.seed(45)
DF <- data.frame(ids = sample(1e4, 1e6, TRUE), x = sample(letters, 1e6, TRUE), 
                  y = runif(1e6))
DT <- as.data.table(DF)

Functions + benchmarking

Note that the order of the data will be different here as split sorts by "ids". IF you want that you can first do setkey(DT, ids) and then run f2.

f1 <- function() split(DF, DF$ids)
f2 <- function() {
    ans <- DT[, list(list(.SD)), by=ids]$V1
    setattr(ans, 'names', unique(DT$ids)) # sets names by reference, no copy here.
}

require(microbenchmark)
microbenchmark(ans1 <- f1(), ans2 <- f2(), times=10)

# Unit: milliseconds
#          expr        min         lq     median         uq       max neval
#  ans1 <- f1() 37015.9795 43994.6629 48132.3364 49086.0926 63829.592    10
#  ans2 <- f2()   332.6094   361.1902   409.2191   528.0674  1005.457    10

split.data.frame took an average of 48 seconds wheres data.table took 0.41 seconds

like image 30
Arun Avatar answered Oct 19 '22 12:10

Arun