Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl: Removing duplicates from a large set of data

I'm using Perl to generate a list of unique exons (which are the units of genes).

I've generated a file in this format (with hundreds of thousands of lines):

chr1 1000 2000 gene1

chr1 3000 4000 gene2

chr1 5000 6000 gene3

chr1 1000 2000 gene4

Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.

Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).

I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.

Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:

for (my $i = 0; $i < scalar @exons; $i++) {
my @temp_line = line_splitter($exons[$i]);                      # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar @exons_dup; $j++) {
    my @inner_temp_line = line_splitter($exons_dup[$j]);        # runs subroutine turning scalar into array
    unless (($temp_line[1] == $inner_temp_line[1]) &&           # this loop ensures that the the loop
            ($temp_line[3] eq $inner_temp_line[3])) {           # below skips the identical lines
                if (($temp_line[1] == $inner_temp_line[1]) &&   # if the coordinates are the same
                    ($temp_line[2] == $inner_temp_line[2])) {   # between the comparisons
                        splice(@exons, $i, 1);                  # delete the first one
                    }
            }
}

}

like image 628
David M Avatar asked Apr 18 '11 21:04

David M


People also ask

What is the method of removing duplicates without the remove duplicate stage?

There are multiple ways to remove duplicates other than using Remove Duplicates Stage. As stated above you can use Sort stage, Transformer stage. In sort stage, you can enable Key Change() column and it will be useful to filter the duplicate records. You can use Aggregator stage to remove duplicates.


1 Answers

my @exons = (
    'chr1 1000 2000 gene1',
    'chr1 3000 4000 gene2',
    'chr1 5000 6000 gene3',
    'chr1 1000 2000 gene4'
);

my %unique_exons = map { 
    my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
    "$chro $scoor $ecoor" => $gene
} @exons;

print "$_ $unique_exons{$_} \n" for keys %unique_exons;

This will give you uniqueness, and the last gene name will be included. This results in:

chr1 1000 2000 gene4 
chr1 5000 6000 gene3 
chr1 3000 4000 gene2
like image 147
dalton Avatar answered Nov 07 '22 06:11

dalton