I have data from several subjects stored in a single CSV file. After importing the CSV file, I would like to split the data from each participant off into its own data.frame. More literally, I would like to take the example data below, and create three new data.frames; one for each of the 'subject_initials' values. <img src="https://i.stack.imgur.com/Utf6t.png" alt="enter image description here"> How do I do this? I've thus far looked into options using the <code>plyr</code> package and <code>split()</code>, but haven't yet found a solution. I know I'm probably missing something obvious.

<code>split</code> seems to be appropriate here. If you start with the following data frame : <pre class="prettyprint"><code>df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5]) </code></pre> Then you can do : <pre class="prettyprint"><code>split(df, df$ids) </code></pre> And you will get a list of data frames : <pre class="prettyprint"><code>R> split(df, df$ids) $`1` ids x y 1 1 1 a 2 1 2 b $`2` ids x y 3 2 3 c 4 2 4 d $`3` ids x y 5 3 5 e </code></pre>

Splitting a data.frame by a variable [duplicate]

2 Answers

split seems to be appropriate here.

If you start with the following data frame :

df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5])

Then you can do :

split(df, df$ids)

And you will get a list of data frames :

R> split(df, df$ids)
$`1`
  ids x y
1   1 1 a
2   1 2 b

$`2`
  ids x y
3   2 3 c
4   2 4 d

$`3`
  ids x y
5   3 5 e

154

answered Oct 19 '22 13:10

juba

split is a generic. Whereas split.default is quite fast, split.data.frame gets terribly slow when the number of levels to split on increases.

The alternate (faster) solution would be to use data.table. I'll illustrate the difference on a bigger data here:

Sample data (what @Roland was referring to in his comment)

require(data.table)
set.seed(45)
DF <- data.frame(ids = sample(1e4, 1e6, TRUE), x = sample(letters, 1e6, TRUE), 
                  y = runif(1e6))
DT <- as.data.table(DF)

Functions + benchmarking

Note that the order of the data will be different here as split sorts by "ids". IF you want that you can first do setkey(DT, ids) and then run f2.

f1 <- function() split(DF, DF$ids)
f2 <- function() {
    ans <- DT[, list(list(.SD)), by=ids]$V1
    setattr(ans, 'names', unique(DT$ids)) # sets names by reference, no copy here.
}

require(microbenchmark)
microbenchmark(ans1 <- f1(), ans2 <- f2(), times=10)

# Unit: milliseconds
#          expr        min         lq     median         uq       max neval
#  ans1 <- f1() 37015.9795 43994.6629 48132.3364 49086.0926 63829.592    10
#  ans2 <- f2()   332.6094   361.1902   409.2191   528.0674  1005.457    10

split.data.frame took an average of 48 seconds wheres data.table took 0.41 seconds

answered Oct 19 '22 12:10

Arun

Related questions
                            
                                More efficient strategy for which() or match()
                            
                                get filename from url path in R
                            
                                Efficient use of functions on long data.frames in R
                            
                                Add new row to matrix one by one
                            
                                matching and counting strings (k-mer of DNA) in R
                            
                                Replace a set of pattern matches with corresponding replacement strings in R
                            
                                R get rows based on multiple conditions - use dplyr and reshape2
                            
                                Stratified sampling on factor
                            
                                Cannot install devtools package after upgrading R
                            
                                How to remove first N rows in a data set in R? [duplicate]
                            
                                Passing reactive values to conditionalPanel condition
                            
                                Distinct enclosing environment, function environment, etc. in R
                            
                                Plotting a 95% confidence interval for a lm object
                            
                                Is there a base R function to dynamically order data.frame columns similar to dplyr everything()?
                            
                                R: turning list items into objects
                            
                                Apply lm to subset of data frame defined by a third column of the frame
                            
                                Unable to format months with as.Date
                            
                                R - From Factor to Numeric or Integer error
                            
                                Conditional Sum in R
                            
                                Reading .dat and .dct directly from R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Splitting a data.frame by a variable [duplicate]

Tags:

dataframe

r

CaptainProg

People also ask