I need to produce a sliding window of millions of lines and to calculate the median of column 3. My data looks like this with column 1 always being the same, column 2 equaling the line number and column 3 being the information that I need the median for:
HiC_scaffold_1 1 34
HiC_scaffold_1 2 34
HiC_scaffold_1 3 36
HiC_scaffold_1 4 37
HiC_scaffold_1 5 38
HiC_scaffold_1 6 39
HiC_scaffold_1 7 40
HiC_scaffold_1 8 40
HiC_scaffold_1 9 40
HiC_scaffold_1 10 41
HiC_scaffold_1 11 41
HiC_scaffold_1 12 41
HiC_scaffold_1 13 44
HiC_scaffold_1 14 44
HiC_scaffold_1 15 55
and I need a result like this, assuming a sliding window of 4 and rounding to the nearest whole number. In the real dataset I'll likely use a sliding window of 1000:
HiC_scaffold_1 4 35
HiC_scaffold_1 5 37
HiC_scaffold_1 6 38
HiC_scaffold_1 7 39
HiC_scaffold_1 8 40
HiC_scaffold_1 9 40
HiC_scaffold_1 10 40
HiC_scaffold_1 11 41
HiC_scaffold_1 12 41
HiC_scaffold_1 13 41
HiC_scaffold_1 14 43
HiC_scaffold_1 15 44
I found the following script here for doing what I want but for mean, not median:
awk -v OFS="\t" 'BEGIN {
window = 4
slide = 1
}
{
mod = NR % window
if (NR <= window) {
count++
} else {
sum -= array[mod]
}
sum += $3
array[mod] = $3
}
(NR % slide) == 0 {
print $1, NR, sum / count
}
' file.txt
and this script for calculating median with awk from here:
sort -n -k3 file.txt |
awk '{
arr[NR] = $3
}
END {
if (NR % 2 == 1) {
print arr[(NR + 1) / 2]
} else {
print $1 "\t" $2 "\t" (arr[NR / 2] + arr[NR / 2 + 1]) / 2
}
}
'
but I can't get them to work together. One other issue is that the median calculation requires a sorted input. I also found this datamash solution but I don't know how to make is work efficiently with a sliding window.
The following assumes the availability of the function asort
, as provided by GNU awk (gawk). The program is parameterized by wsize, the window size -- here 4:
gawk -v wsize=4 '
BEGIN {
if (wsize % 2 == 0) { m1=wsize/2; m2=m1+1; } else { m1 = m2 = (wsize+1)/2; }
}
function roundedmedian() {
asort(window, a);
return (m1==m2) ? a[m1] : int(0.5 + ((a[m1] + a[m2]) / 2));
}
function push(value) {
window[NR % wsize] = value;
}
NR < wsize { window[NR]=$3; next; }
{ push($3);
$3 = roundedmedian();
print $0;
}'
Using GNU awk for asort()
:
$ cat tst.awk
BEGIN {
OFS = "\t"
window = 4
befMid = int(window / 2)
aftMid = befMid + (window % 2 ? 0 : 1)
}
{ array[NR % window] = $3 }
NR >= window {
asort(array,vals)
print $1, $2, int( (vals[befMid] + vals[aftMid]) / 2 + 0.5 )
}
.
$ awk -f tst.awk file
HiC_scaffold_1 4 35
HiC_scaffold_1 5 37
HiC_scaffold_1 6 38
HiC_scaffold_1 7 39
HiC_scaffold_1 8 40
HiC_scaffold_1 9 40
HiC_scaffold_1 10 40
HiC_scaffold_1 11 41
HiC_scaffold_1 12 41
HiC_scaffold_1 13 41
HiC_scaffold_1 14 43
HiC_scaffold_1 15 44
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With