I have a data.frame
of 130,209 rows.
> head(dt)
mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh fc mean
A_00001 37.00 12.75 99.25 78.50 68.125 45.625 1.4931507 56.8750
A_00002 31.00 21.50 84.75 53.00 57.875 37.250 1.5536913 47.5625
A_00003 72.50 26.50 81.75 74.75 77.125 50.625 1.5234568 63.8750
I want to split the data.frame
in 12, apply the scale
function on the column fc
and then combine it. There is no grouping variable here, else I'd have used ddply
. Also, because 130,209 is not perfectly divisible by 12, the resulting data.frames
will be unbalanced, i.e., 11 data.frame
s will have 10,851 rows and the last one will have 10,848 rows, but that's fine.
So how do I split a data.frame
by row into chunks of n (in this case 12), apply a function and then combine them together? Any help'd be much appreciated.
Update: Using the two top solutions, I get different results: Using @Ben Bolker's solution,
mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc
1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 -0.5231249
Using @MichaelChirico's answer:
mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc fc_scaled
1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 0.5555556 -0.5089608
We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
Here, we use the DataFrame. groupby() method for splitting the dataset by rows. The same grouped rows are taken as a single element and stored in a list. This list is the required output which consists of small DataFrames.
I'm not sure the structure of dt
matters that much (if you are not using any of its internal values to do the splitting). Does this help?
spl.dt <- split( dt , cut(1:nrow(dt), 12) )
lapply( spl.dt, my_fun)
ggplot2
has a cut_number()
convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at ggplot2:::breaks
for the necessary logic.
Reproducible example stolen from @MichaelChirico:
set.seed(100)
KK<-130209L; nn<-12L
library("dplyr")
dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)
With apologies to those who don't like pipes:
library("ggplot2") ## for cut_number()
dt %>% mutate(grp=cut_number(mean,12)) %>%
group_by(grp) %>%
mutate(fc=c(scale(fc))) %>%
ungroup() %>%
select(-grp) %>% ## drop grouping variable
as.data.frame -> dt2 ## convert back to data frame, assign result
It turns out that the c()
around scale()
is necessary -- otherwise the fc
variable ends up with some attributes that confuse tail()
...
The same logic should apply to using plyr
, or base R split-apply-combine, as well (the key is using cut_number()
to define the grouping variable).
With data.table
, you can do:
library(data.table)
setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]
Here, KK
is 130,209 and nn
is 12. Reproducible data:
set.seed(100)
KK<-130209L; nn<-12L
dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK))
So no need to split the data and recombine.
If you'd like to add this to the data frame instead of just extract it, you can use the :=
operator to assign by reference:
setDT(dt)[,fc_scaled:=scale(fc)...]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With