Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get regions from a file that are part of regions in other file (Without loops)

Tags:

sed

awk

perl

matlab

I have two files:

regions.txt: First column is the chromosome name, second and third are start and end position.

1  100  200
1  400  600
2  600  700

coverage.txt: First column is chromosome name, again second and third are start and end positions, and last column is the score.

1 100 101  5
1 101 102  7 
1 103 105  8
2 600 601  10
2 601 602  15

This file is very huge it is about 15GB with about 300 million lines.

I basically want to get the mean of all scores in coverage.txt that are in each region in regions.txt.

In other words, start at the first line in regions.txt, if there is a line in coverage.txt which has the same chromosome, start-coverage is >= start-region, and end-coverage is <= end-region, then save its score to a new array. After finish searching in all coverages.txt print the region chromosome, start, end, and the mean of all scores that have been found.

Expected output:

1  100 200 14.6   which is (5+7+8)/3
1  400 600 0      no match at coverages.txt
2  600 700 12.5   which is (10+15)/2

I built the following MATLAB script which take very long time since I have to loop over coverage.txt many time. I don't know how to make a fast awk similar script.

My matlab script

fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');

while feof(ft) == 0

linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];

    frewind(fc);
    while feof(fc) == 0

    linec = fgetl(fc);
    scanc = textscan(linec, '%d%d%d%d');
    cchr = scanc{1};
    cx = scanc{2};
    cy = scanc{3};
    cov = scanc{4};


        if (cchr == tchr) && (cx >= tx) && (cy <= ty)

            coverages = cat(2, coverages, cov);

        end

    end

    covmed = median(coverages);
    fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);

end    

Any suggestions to make an alternative using AWK, Perl, or , ... etc I will aslo be pleased if someone can teach me how to get rid of all loops in my matlab script.

Thanks

like image 898
user1526694 Avatar asked Oct 13 '12 11:10

user1526694


1 Answers

Here is a Perl solution. I use hashes (aka dictionaries) to access the various ranges via the chromosome, thus reducing the number of loop iterations.

This is potentially efficient, as I don't do a full loop over regions.txt on every input line. Efficiency could perhaps be increased further when multithreading is used.

#!/usr/bin/perl

my ($rangefile) = @ARGV;
open my $rFH, '<', $rangefile    or die "Can't open $rangefile";

# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
    chomp;
    my @field = split /\s+/;
    push @{$ranges{$field[0]}}, [@field[1,2], 0, 0];
}
close $rFH;

# iterate over all the input
while (my $line = <STDIN>) {
    chomp $line;
    my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
    # only loop over ranges with matching chromosome
    foreach my $range (@{$ranges{$chrom}}) {
        if ($$range[0] <= $lower and $upper <= $$range[1]) {
            $$range[2]++;
            $$range[3] += $value;
            last; # break out of foreach early because ranges don't overlap
        }
    }
}

# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
    foreach my $range (@{$ranges{$chrom}}) {
        my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
        printf "%d %d %d %.1f\n", $chrom, @$range[0,1], $value;
    }
}

Example invocation:

$ perl script.pl regions.txt <coverage.txt >output.txt

Output on the example input:

1 100 200 6.7
1 400 600 0.0
2 600 700 12.5

(because (5+7+8)/3 = 6.66…)

like image 89
amon Avatar answered Sep 20 '22 04:09

amon