i have two files one with positions information and another is sequence information. Now i need to read the positions and take the snps at the positions and replace that position base with the snp information in the sequence and write it in the snp information file.. for example
Snp file contains
10 A C A/C
Sequence file contains
ATCGAACTCTACATTAC
Here 10th element is T so i will replace T with [A/C] so the final output should be
10 A C A/C ATCGAACTC[A/C]ACATTAC
Example files are
Snp file
SNP Ref Alt
10 A C
19 G C
30 C T
42 A G
Sequence :
sequence1 CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
Final output :
SNP Ref Alt Output
10 A C CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19 G C CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30 C T CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42 A G CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.
Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.
I have written code where it prints the positions in order and also replaces the sequence with that snp position but unable to read nearby positions and replace with N.
My approach might be completely wrong as i am beginner in coding.. I think by using hashes, we might achieve this easily but i am not so familiar with hashes..help with some suggestions please.. i dont have to stick with only perl,
my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];
%sequence_hash = ();
open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;
#### hashes and array
@id_array = ();
while ($line = <SNP>) {
next if $line =~ /^No/;
$line =~ s/\r|\n//g;
# if ($line =~ /\tINDEL/) {
# $indel_count++;
# $snp_type = "INDEL";
#} else {
# $snp_count++;
# $snp_type = "SNP";
#}
@parts = split(/\t/,$line);
$id = $parts[0];
$pos = $parts[1];
#$ref_base = $parts[3];
@temp_ref = split(",",$parts[2]);
$ref_base = $temp_ref[0];
@alt = split(",",$parts[3]);
$alt_base = $alt[0];
$snpformat = '';
if ($ref_base eq "A" || $alt_base eq "A")
{
if ($ref_base eq "A"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
elsif ($ref_base eq "T" || $alt_base eq "T")
{
if ($ref_base eq "T"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
elsif ($ref_base eq "C" || $alt_base eq "C")
{
if ($ref_base eq "C"){
$snpformat = "[".join("/",$ref_base,$alt_base)."]";}
#$snpformat = $ref_base/$alt_base;}
#print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }
else
{$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
#print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
}
else
{$snpformat = "-/-" ;}
print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n ";
}
open SEQ, $input_file ||die $!;
$header = '';
$sequence = '';
$num_sequences = 0;
while ($line = <SEQ>) {
$line =~ s/\r|\n//g;
next if $line =~ //;
if ($line =~ /^>(.+)/) {
if ($header eq '') {
$header = $1;
$sequence = '';
next;
} else {
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";
#print $sequence."\n";
$num_sequences++;
$header = $1;
$sequence = '';
}
} else {
$sequence .= $line;
}
}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";
$num_sequences++;
close (SEQ);
$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";
print "$pos \n";
This awk
script can probably help you get desired result.
awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR {
a[++i]=$0
next
}
FNR>1 {
x=substr(a[i],1,($1-1))
z=substr(a[i],($1+1))
if ($2=="A") {
y="["$2"/"$3"]"
}
else if ($2=="T" && $3=="A") {
y="["$3"/"$2"]"
}
else if ($2=="C" && ($2=="A" || $2=="T")) {
y="["$3"/"$2"]"
}
else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
y="["$3"/"$2"]"
}
else
y="["$3"/"$2"]"
print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
[jaypal:~/temp] cat sequence
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
[jaypal:~/temp] cat snp
SNP Ref Alt
10 A C
19 G C
30 C T
42 A G
[jaypal:~/temp] awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR {
a[++i]=$0
next
}
FNR>1 {
x=substr(a[i],1,($1-1))
z=substr(a[i],($1+1))
if ($2=="A") {
y="["$2"/"$3"]"
}
else if ($2=="T" && $3=="A") {
y="["$3"/"$2"]"
}
else if ($2=="C" && ($2=="A" || $2=="T")) {
y="["$3"/"$2"]"
}
else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
y="["$3"/"$2"]"
}
else
y="["$3"/"$2"]"
print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
SNP Ref Alt Output
10 A C CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19 G C CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30 C T CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42 A G CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With