Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do I split a dataframe by row into chunks of n, apply a function and combine?

Tags:

split

r

apply

I have a data.frame of 130,209 rows.

> head(dt)

              mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh        fc     mean
     A_00001  37.00  12.75 99.25  78.50  68.125   45.625 1.4931507  56.8750
     A_00002  31.00  21.50 84.75  53.00  57.875   37.250 1.5536913  47.5625
     A_00003  72.50  26.50 81.75  74.75  77.125   50.625 1.5234568  63.8750

I want to split the data.frame in 12, apply the scale function on the column fc and then combine it. There is no grouping variable here, else I'd have used ddply. Also, because 130,209 is not perfectly divisible by 12, the resulting data.frames will be unbalanced, i.e., 11 data.frames will have 10,851 rows and the last one will have 10,848 rows, but that's fine.

So how do I split a data.frame by row into chunks of n (in this case 12), apply a function and then combine them together? Any help'd be much appreciated.

Update: Using the two top solutions, I get different results: Using @Ben Bolker's solution,

mLow1 mHigh1 mLow2 mHigh2          UID       gene_id meanLow meanHigh mean         fc
  1.5   3.25     1   1.25 MGLibB_00021 0610010K14Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00034 0610037L13Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibB_00058 1100001G20Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00061 1110001A16Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00104 1110034G24Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00110 1110038F14Rik    1.25     2.25 1.75 -0.5231249

Using @MichaelChirico's answer:

mLow1 mHigh1 mLow2 mHigh2          UID       gene_id meanLow meanHigh mean        fc  fc_scaled
  1.5   3.25     1   1.25 MGLibB_00021 0610010K14Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00034 0610037L13Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibB_00058 1100001G20Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00061 1110001A16Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00104 1110034G24Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00110 1110038F14Rik    1.25     2.25 1.75 0.5555556 -0.5089608
like image 696
Komal Rathi Avatar asked Jul 31 '15 19:07

Komal Rathi


People also ask

How do you split data frames by rows?

We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.

How do you use split apply combine strategy in Pandas Groupby?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.

How do you split a DataFrame into multiple data frames?

Here, we use the DataFrame. groupby() method for splitting the dataset by rows. The same grouped rows are taken as a single element and stored in a list. This list is the required output which consists of small DataFrames.


3 Answers

I'm not sure the structure of dt matters that much (if you are not using any of its internal values to do the splitting). Does this help?

 spl.dt <- split( dt , cut(1:nrow(dt), 12) )

 lapply( spl.dt, my_fun) 
like image 72
IRTFM Avatar answered Oct 19 '22 15:10

IRTFM


ggplot2 has a cut_number() convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at ggplot2:::breaks for the necessary logic.

Reproducible example stolen from @MichaelChirico:

set.seed(100)
KK<-130209L; nn<-12L
library("dplyr")
dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
               mLow2=rnorm(KK),mHigh2=rnorm(KK),
               meanLow=rnorm(KK),meanHigh=rnorm(KK),
               fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)

With apologies to those who don't like pipes:

library("ggplot2")  ## for cut_number()
dt %>% mutate(grp=cut_number(mean,12)) %>%
       group_by(grp) %>%
       mutate(fc=c(scale(fc))) %>%
       ungroup() %>%        
       select(-grp) %>%     ## drop grouping variable
       as.data.frame -> dt2 ## convert back to data frame, assign result

It turns out that the c() around scale() is necessary -- otherwise the fc variable ends up with some attributes that confuse tail() ...

The same logic should apply to using plyr, or base R split-apply-combine, as well (the key is using cut_number() to define the grouping variable).

like image 39
Ben Bolker Avatar answered Oct 19 '22 16:10

Ben Bolker


With data.table, you can do:

library(data.table)
setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]

Here, KK is 130,209 and nn is 12. Reproducible data:

set.seed(100)
KK<-130209L; nn<-12L
dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
               mLow2=rnorm(KK),mHigh2=rnorm(KK),
               meanLow=rnorm(KK),meanHigh=rnorm(KK),
               fc=rnorm(KK),mean=rnorm(KK))

So no need to split the data and recombine.

If you'd like to add this to the data frame instead of just extract it, you can use the := operator to assign by reference:

setDT(dt)[,fc_scaled:=scale(fc)...]
like image 2
MichaelChirico Avatar answered Oct 19 '22 16:10

MichaelChirico