There are two numeric columns in a data file. I need to calculate the average of the second column by intervals (such as 100) of the first column.
I can program this task in R, but my R code is really slow for a relatively large data file (millions of rows, with the value of first column changing between 1 to 33132539).
Here I show my R code. How could I tune it to be faster? Other solutions that are perl, python, awk or shell based are appreciated.
Thanks in advance.
(1) my data file (tab-delimited, millions of rows)
5380 30.07383\n
5390 30.87\n
5393 0.07383\n
5404 6\n
5428 30.07383\n
5437 1\n
5440 9\n
5443 30.07383\n
5459 6\n
5463 30.07383\n
5480 7\n
5521 30.07383\n
5538 0\n
5584 20\n
5673 30.07383\n
5720 30.07383\n
5841 3\n
5880 30.07383\n
5913 4\n
5958 30.07383\n
(2) what I want to get, here interval = 100
intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0\n
200, 0\n
300, 20.34074\n
400, 14.90325\n
.....
(3) R code
chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval
spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data
interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get
# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
meanrho.chr1[i]<-mean(count.sub$rho)
}
You don't really need to set up an output data.frame but you can if you want. Here is how I would have coded it, and I guarantee it will be fast.
> dat$incrmt <- dat$V1 %/% 100
> dat
V1 V2 incrmt
1 5380 30.07383 53
2 5390 30.87000 53
3 5393 0.07383 53
4 5404 6.00000 54
5 5428 30.07383 54
6 5437 1.00000 54
7 5440 9.00000 54
8 5443 30.07383 54
9 5459 6.00000 54
10 5463 30.07383 54
11 5480 7.00000 54
12 5521 30.07383 55
13 5538 0.00000 55
14 5584 20.00000 55
15 5673 30.07383 56
16 5720 30.07383 57
17 5841 3.00000 58
18 5880 30.07383 58
19 5913 4.00000 59
20 5958 30.07383 59
> with(dat, tapply(V2, incrmt, mean, na.rm=TRUE))
53 54 55 56 57 58 59
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692
You could have done even less setup (skip the incrmt variable with this code:
> with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
53 54 55 56 57 58 59
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692
And if you want the result to be available for something:
by100MeanV2 <- with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
use strict;
use warnings;
my $BIN_SIZE = 100;
my %freq;
while (<>){
my ($k, $v) = split;
my $bin = $BIN_SIZE * int($k / $BIN_SIZE);
$freq{$bin}{n} ++;
$freq{$bin}{sum} += $v;
}
for my $bin (sort { $a <=> $b } keys %freq){
my ($n, $sum) = map $freq{$bin}{$_}, qw(n sum);
print join("\t", $bin, $n, $sum, $sum / $n), "\n";
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With