Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate median of a sliding window with awk

I need to produce a sliding window of millions of lines and to calculate the median of column 3. My data looks like this with column 1 always being the same, column 2 equaling the line number and column 3 being the information that I need the median for:

HiC_scaffold_1  1   34
HiC_scaffold_1  2   34
HiC_scaffold_1  3   36
HiC_scaffold_1  4   37
HiC_scaffold_1  5   38
HiC_scaffold_1  6   39
HiC_scaffold_1  7   40
HiC_scaffold_1  8   40
HiC_scaffold_1  9   40
HiC_scaffold_1  10  41
HiC_scaffold_1  11  41
HiC_scaffold_1  12  41
HiC_scaffold_1  13  44
HiC_scaffold_1  14  44
HiC_scaffold_1  15  55

and I need a result like this, assuming a sliding window of 4 and rounding to the nearest whole number. In the real dataset I'll likely use a sliding window of 1000:

HiC_scaffold_1  4   35
HiC_scaffold_1  5   37
HiC_scaffold_1  6   38
HiC_scaffold_1  7   39
HiC_scaffold_1  8   40
HiC_scaffold_1  9   40
HiC_scaffold_1  10  40
HiC_scaffold_1  11  41
HiC_scaffold_1  12  41
HiC_scaffold_1  13  41
HiC_scaffold_1  14  43
HiC_scaffold_1  15  44

I found the following script here for doing what I want but for mean, not median:

awk -v OFS="\t" 'BEGIN {
        window = 4
        slide = 1
}

{
        mod = NR % window
        if (NR <= window) {
                count++
        } else {
                sum -= array[mod]
        }
        sum += $3
        array[mod] = $3
}

(NR % slide) == 0 {
        print $1, NR, sum / count
}
' file.txt

and this script for calculating median with awk from here:

sort -n -k3 file.txt |
awk '{
        arr[NR] = $3
}

END {
        if (NR % 2 == 1) {
                print arr[(NR + 1) / 2]
        } else {
                print $1 "\t" $2 "\t" (arr[NR / 2] + arr[NR / 2 + 1]) / 2
        }
}
'

but I can't get them to work together. One other issue is that the median calculation requires a sorted input. I also found this datamash solution but I don't know how to make is work efficiently with a sliding window.

like image 255
acalcino Avatar asked Mar 24 '20 13:03

acalcino


2 Answers

The following assumes the availability of the function asort, as provided by GNU awk (gawk). The program is parameterized by wsize, the window size -- here 4:

gawk -v wsize=4 '
   BEGIN { 
    if (wsize % 2 == 0) { m1=wsize/2; m2=m1+1; } else { m1 = m2 = (wsize+1)/2; } 
   }
   function roundedmedian() {
     asort(window, a);
     return (m1==m2) ? a[m1] : int(0.5 + ((a[m1] + a[m2]) / 2));
   }
   function push(value) {
     window[NR % wsize] = value;
   }
   NR < wsize { window[NR]=$3; next; }
   { push($3);
     $3 = roundedmedian();
     print $0;
   }' 
like image 161
peak Avatar answered Oct 20 '22 09:10

peak


Using GNU awk for asort():

$ cat tst.awk
BEGIN {
    OFS = "\t"
    window = 4
    befMid = int(window / 2)
    aftMid = befMid + (window % 2 ? 0 : 1)
}
{ array[NR % window] = $3 }
NR >= window {
    asort(array,vals)
    print $1, $2, int( (vals[befMid] + vals[aftMid]) / 2 + 0.5 )
}

.

$ awk -f tst.awk file
HiC_scaffold_1  4       35
HiC_scaffold_1  5       37
HiC_scaffold_1  6       38
HiC_scaffold_1  7       39
HiC_scaffold_1  8       40
HiC_scaffold_1  9       40
HiC_scaffold_1  10      40
HiC_scaffold_1  11      41
HiC_scaffold_1  12      41
HiC_scaffold_1  13      41
HiC_scaffold_1  14      43
HiC_scaffold_1  15      44
like image 4
Ed Morton Avatar answered Oct 20 '22 07:10

Ed Morton