I am back with another question. I have a list of data:
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
And I would like to compare the 3rd elements && 5th elements of each row, then group them if they have the same 3rd && 5th elements. For example, with the data above, the results will be :
3: 3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
9: 9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
Fyi, in the actual data, the 3rd, 5th, 7th elements are very long. I have made them cut to see the whole.
This is what I have done, I know it is very clumsy, but as a beginner, I am doing my best. And the problem is that it shows only the first set of 'same' group. Could you show me where it went wrong and/or other pretty methods to solve this, please?
my $file = <>;
open(IN, $file)|| die "no $file: $!\n";
my @arr;
while (my $line=<IN>){
push @arr, [split (/\s+/, $line)] ;
}
close IN;
my (@temp1, @temp2,%hash1);
for (my $i=0;$i<=$#arr ;$i++) {
push @temp1, [$arr[$i][2], $arr[$i][4]];
for (my $j=$i+1;$j<=$#arr ;$j++) {
push @temp2, [$arr[$j][2], $arr[$j][4]];
if (($temp1[$i][0] eq $temp2[$j][0])&& ($temp1[$i][1] eq $temp2[$j][1])) {
push @{$hash1{$arr[$i][0]}}, $arr[$i], $arr[$j];
}
}
}
print Dumper \%hash1;
You appear to have overcomplicated this a bit more than it needs to be, but that's common for beginners. Think more about how you would do this manually:
The looping and all that is completely unnecessary:
#!/usr/bin/env perl
use strict;
use warnings;
my ($previous_row, $third, $fifth) = ('') x 3;
while (<DATA>) {
my @fields = split;
if ($fields[2] eq $third && $fields[4] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_";
$previous_row = '';
} else {
$previous_row = $fields[0] . "\t" . $_;
$third = $fields[2];
$fifth = $fields[4];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
(Note that I changed line 10 slightly so that its third field will match line 9 in order to get the same groups in the output as specified.)
Edit: One line of code was duplicated by a copy/paste error.
Edit 2: In response to comments, here's a second version which doesn't assume that the lines which should be grouped are contiguous:
#!/usr/bin/env perl
use strict;
use warnings;
my @lines;
while (<DATA>) {
push @lines, [ $_, split ];
}
# Sort @lines based on third and fifth fields (alphabetically), then on
# first field/line number (numerically) when third and fifth fields match
@lines = sort {
$a->[3] cmp $b->[3] || $a->[5] cmp $b->[5] || $a->[1] <=> $b->[1]
} @lines;
my ($previous_row, $third, $fifth) = ('') x 3;
for (@lines) {
if ($_->[3] eq $third && $_->[5] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_->[0]";
$previous_row = '';
} else {
$previous_row = $_->[1] . "\t" . $_->[0];
$third = $_->[3];
$fifth = $_->[5];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With