I have a <code>data.frame</code> of 130,209 rows. <pre class="prettyprint"><code>> head(dt) mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh fc mean A_00001 37.00 12.75 99.25 78.50 68.125 45.625 1.4931507 56.8750 A_00002 31.00 21.50 84.75 53.00 57.875 37.250 1.5536913 47.5625 A_00003 72.50 26.50 81.75 74.75 77.125 50.625 1.5234568 63.8750 </code></pre> I want to split the <code>data.frame</code> in 12, apply the <code>scale</code> function on the column <code>fc</code> and then combine it. There is no grouping variable here, else I'd have used <code>ddply</code>. Also, because 130,209 is not perfectly divisible by 12, the resulting <code>data.frames</code> will be unbalanced, i.e., 11 <code>data.frame</code>s will have 10,851 rows and the last one will have 10,848 rows, but that's fine. So how do I split a <code>data.frame</code> by row into chunks of n (in this case 12), apply a function and then combine them together? Any help'd be much appreciated. Update: Using the two top solutions, I get different results: Using @Ben Bolker's solution, <pre class="prettyprint"><code>mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc 1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 -0.5231249 1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 -0.5231249 </code></pre> Using @MichaelChirico's answer: <pre class="prettyprint"><code>mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc fc_scaled 1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 0.5555556 -0.5089608 1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 0.5555556 -0.5089608 </code></pre>

<code>ggplot2</code> has a <code>cut_number()</code> convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at <code>ggplot2:::breaks</code> for the necessary logic. Reproducible example stolen from @MichaelChirico: <pre class="prettyprint"><code>set.seed(100) KK<-130209L; nn<-12L library("dplyr") dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK), mLow2=rnorm(KK),mHigh2=rnorm(KK), meanLow=rnorm(KK),meanHigh=rnorm(KK), fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean) </code></pre> With apologies to those who don't like pipes: <pre class="prettyprint"><code>library("ggplot2") ## for cut_number() dt %>% mutate(grp=cut_number(mean,12)) %>% group_by(grp) %>% mutate(fc=c(scale(fc))) %>% ungroup() %>% select(-grp) %>% ## drop grouping variable as.data.frame -> dt2 ## convert back to data frame, assign result </code></pre> It turns out that the <code>c()</code> around <code>scale()</code> is necessary -- otherwise the <code>fc</code> variable ends up with some attributes that confuse <code>tail()</code> ... The same logic should apply to using <code>plyr</code>, or base R split-apply-combine, as well (the key is using <code>cut_number()</code> to define the grouping variable).

how do I split a dataframe by row into chunks of n, apply a function and combine?

I have a data.frame of 130,209 rows.

> head(dt)

              mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh        fc     mean
     A_00001  37.00  12.75 99.25  78.50  68.125   45.625 1.4931507  56.8750
     A_00002  31.00  21.50 84.75  53.00  57.875   37.250 1.5536913  47.5625
     A_00003  72.50  26.50 81.75  74.75  77.125   50.625 1.5234568  63.8750

I want to split the data.frame in 12, apply the scale function on the column fc and then combine it. There is no grouping variable here, else I'd have used ddply. Also, because 130,209 is not perfectly divisible by 12, the resulting data.frames will be unbalanced, i.e., 11 data.frames will have 10,851 rows and the last one will have 10,848 rows, but that's fine.

So how do I split a data.frame by row into chunks of n (in this case 12), apply a function and then combine them together? Any help'd be much appreciated.

Update: Using the two top solutions, I get different results: Using @Ben Bolker's solution,

mLow1 mHigh1 mLow2 mHigh2          UID       gene_id meanLow meanHigh mean         fc
  1.5   3.25     1   1.25 MGLibB_00021 0610010K14Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00034 0610037L13Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibB_00058 1100001G20Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00061 1110001A16Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00104 1110034G24Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00110 1110038F14Rik    1.25     2.25 1.75 -0.5231249

Using @MichaelChirico's answer:

mLow1 mHigh1 mLow2 mHigh2          UID       gene_id meanLow meanHigh mean        fc  fc_scaled
  1.5   3.25     1   1.25 MGLibB_00021 0610010K14Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00034 0610037L13Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibB_00058 1100001G20Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00061 1110001A16Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00104 1110034G24Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00110 1110038F14Rik    1.25     2.25 1.75 0.5555556 -0.5089608

How do you split data frames by rows?

We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.

How do you use split apply combine strategy in Pandas Groupby?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.

How do you split a DataFrame into multiple data frames?

Here, we use the DataFrame. groupby() method for splitting the dataset by rows. The same grouped rows are taken as a single element and stored in a list. This list is the required output which consists of small DataFrames.

I'm not sure the structure of dt matters that much (if you are not using any of its internal values to do the splitting). Does this help?

 spl.dt <- split( dt , cut(1:nrow(dt), 12) )

 lapply( spl.dt, my_fun)

ggplot2 has a cut_number() convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at ggplot2:::breaks for the necessary logic.

Reproducible example stolen from @MichaelChirico:

set.seed(100)
KK<-130209L; nn<-12L
library("dplyr")
dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
               mLow2=rnorm(KK),mHigh2=rnorm(KK),
               meanLow=rnorm(KK),meanHigh=rnorm(KK),
               fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)

With apologies to those who don't like pipes:

library("ggplot2")  ## for cut_number()
dt %>% mutate(grp=cut_number(mean,12)) %>%
       group_by(grp) %>%
       mutate(fc=c(scale(fc))) %>%
       ungroup() %>%        
       select(-grp) %>%     ## drop grouping variable
       as.data.frame -> dt2 ## convert back to data frame, assign result

It turns out that the c() around scale() is necessary -- otherwise the fc variable ends up with some attributes that confuse tail() ...

The same logic should apply to using plyr, or base R split-apply-combine, as well (the key is using cut_number() to define the grouping variable).

With data.table, you can do:

library(data.table)
setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]

Here, KK is 130,209 and nn is 12. Reproducible data:

set.seed(100)
KK<-130209L; nn<-12L
dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
               mLow2=rnorm(KK),mHigh2=rnorm(KK),
               meanLow=rnorm(KK),meanHigh=rnorm(KK),
               fc=rnorm(KK),mean=rnorm(KK))

So no need to split the data and recombine.

If you'd like to add this to the data frame instead of just extract it, you can use the := operator to assign by reference:

setDT(dt)[,fc_scaled:=scale(fc)...]

how do I split a dataframe by row into chunks of n, apply a function and combine?

Tags:

split

r

apply

Komal Rathi

People also ask

3 Answers

IRTFM

Ben Bolker

MichaelChirico

Recent Activity

Donate For Us

how do I split a dataframe by row into chunks of n, apply a function and combine?

Tags:

split

r

apply

Komal Rathi

People also ask

3 Answers

IRTFM

Ben Bolker

MichaelChirico

Related questions

Recent Activity

Donate For Us