Efficiently average the second column by intervals defined by the first column

Question

There are two numeric columns in a data file. I need to calculate the average of the second column by intervals (such as 100) of the first column.

I can program this task in R, but my R code is really slow for a relatively large data file (millions of rows, with the value of first column changing between 1 to 33132539).

Here I show my R code. How could I tune it to be faster? Other solutions that are perl, python, awk or shell based are appreciated.

Thanks in advance.

(1) my data file (tab-delimited, millions of rows)

5380    30.07383

5390    30.87

5393    0.07383

5404    6

5428    30.07383

5437    1

5440    9

5443    30.07383

5459    6

5463    30.07383

5480    7

5521    30.07383

5538    0

5584    20

5673    30.07383

5720    30.07383

5841    3

5880    30.07383

5913    4

5958    30.07383

(2) what I want to get, here interval = 100

intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0

200, 0

300, 20.34074

400, 14.90325

.....

(3) R code

chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval

spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data 

interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get

# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
  count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
  meanrho.chr1[i]<-mean(count.sub$rho)
}

IRTFM · Accepted Answer

You don't really need to set up an output data.frame but you can if you want. Here is how I would have coded it, and I guarantee it will be fast.

> dat$incrmt <- dat$V1 %/% 100
> dat
     V1       V2 incrmt
1  5380 30.07383     53
2  5390 30.87000     53
3  5393  0.07383     53
4  5404  6.00000     54
5  5428 30.07383     54
6  5437  1.00000     54
7  5440  9.00000     54
8  5443 30.07383     54
9  5459  6.00000     54
10 5463 30.07383     54
11 5480  7.00000     54
12 5521 30.07383     55
13 5538  0.00000     55
14 5584 20.00000     55
15 5673 30.07383     56
16 5720 30.07383     57
17 5841  3.00000     58
18 5880 30.07383     58
19 5913  4.00000     59
20 5958 30.07383     59

> with(dat, tapply(V2, incrmt, mean, na.rm=TRUE))
      53       54       55       56       57       58       59 
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692

You could have done even less setup (skip the incrmt variable with this code:

    > with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
      53       54       55       56       57       58       59 
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692

And if you want the result to be available for something:

by100MeanV2 <- with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))

FMc · Answer

use strict;
use warnings;

my $BIN_SIZE = 100;
my %freq;

while (<>){
    my ($k, $v) = split;
    my $bin = $BIN_SIZE * int($k / $BIN_SIZE);
    $freq{$bin}{n} ++;
    $freq{$bin}{sum} += $v;
}

for my $bin (sort { $a <=> $b  } keys %freq){
    my ($n, $sum) = map $freq{$bin}{$_}, qw(n sum);
    print join("	", $bin, $n, $sum, $sum / $n), "
";
}

Efficiently average the second column by intervals defined by the first column

Tags:

python

r

data.table

awk

perl

jianfeng.mao

2 Answers

IRTFM

FMc

Recent Activity

Donate For Us

Efficiently average the second column by intervals defined by the first column

Tags:

python

r

data.table

awk

perl

jianfeng.mao

2 Answers

IRTFM

FMc

Related questions

Recent Activity

Donate For Us