I'm using Perl to generate a list of unique exons (which are the units of genes).
I've generated a file in this format (with hundreds of thousands of lines):
chr1 1000 2000 gene1
chr1 3000 4000 gene2
chr1 5000 6000 gene3
chr1 1000 2000 gene4
Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.
Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).
I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.
Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:
for (my $i = 0; $i < scalar @exons; $i++) {
my @temp_line = line_splitter($exons[$i]); # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar @exons_dup; $j++) {
my @inner_temp_line = line_splitter($exons_dup[$j]); # runs subroutine turning scalar into array
unless (($temp_line[1] == $inner_temp_line[1]) && # this loop ensures that the the loop
($temp_line[3] eq $inner_temp_line[3])) { # below skips the identical lines
if (($temp_line[1] == $inner_temp_line[1]) && # if the coordinates are the same
($temp_line[2] == $inner_temp_line[2])) { # between the comparisons
splice(@exons, $i, 1); # delete the first one
}
}
}
}
There are multiple ways to remove duplicates other than using Remove Duplicates Stage. As stated above you can use Sort stage, Transformer stage. In sort stage, you can enable Key Change() column and it will be useful to filter the duplicate records. You can use Aggregator stage to remove duplicates.
my @exons = (
'chr1 1000 2000 gene1',
'chr1 3000 4000 gene2',
'chr1 5000 6000 gene3',
'chr1 1000 2000 gene4'
);
my %unique_exons = map {
my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
"$chro $scoor $ecoor" => $gene
} @exons;
print "$_ $unique_exons{$_} \n" for keys %unique_exons;
This will give you uniqueness, and the last gene name will be included. This results in:
chr1 1000 2000 gene4
chr1 5000 6000 gene3
chr1 3000 4000 gene2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With