Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify Sequence with snp position and output in same file

Tags:

sed

awk

perl

i have two files one with positions information and another is sequence information. Now i need to read the positions and take the snps at the positions and replace that position base with the snp information in the sequence and write it in the snp information file.. for example

Snp file contains

10 A C A/C

Sequence file contains

ATCGAACTCTACATTAC

Here 10th element is T so i will replace T with [A/C] so the final output should be

10 A C A/C ATCGAACTC[A/C]ACATTAC

Example files are

Snp file

SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

Sequence :

sequence1 CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

Final output :

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.

Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.

I have written code where it prints the positions in order and also replaces the sequence with that snp position but unable to read nearby positions and replace with N.

My approach might be completely wrong as i am beginner in coding.. I think by using hashes, we might achieve this easily but i am not so familiar with hashes..help with some suggestions please.. i dont have to stick with only perl,

my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];

%sequence_hash = ();

open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;

#### hashes and array
@id_array = ();

while ($line = <SNP>) {

    next if $line =~ /^No/;
    $line =~ s/\r|\n//g;


   # if ($line =~ /\tINDEL/) {

    #    $indel_count++;
     #   $snp_type = "INDEL";

    #} else {
     #   $snp_count++;
      #  $snp_type = "SNP";
    #}

    @parts = split(/\t/,$line);

    $id = $parts[0];
    $pos = $parts[1];
    #$ref_base = $parts[3];
    @temp_ref = split(",",$parts[2]);
    $ref_base = $temp_ref[0];
    @alt = split(",",$parts[3]);
    $alt_base = $alt[0];
    $snpformat = '';

    if ($ref_base eq "A" || $alt_base eq "A")
    {

        if ($ref_base eq "A"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "T" || $alt_base eq "T")
    {

        if ($ref_base eq "T"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "C" || $alt_base eq "C")
    {

        if ($ref_base eq "C"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }


    else 
    {$snpformat = "-/-" ;}
    print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n  ";
}

open SEQ, $input_file ||die $!;

$header = '';
$sequence = '';
$num_sequences = 0;

while ($line = <SEQ>) {

    $line =~ s/\r|\n//g;
    next if $line =~ //;

    if ($line =~ /^>(.+)/) {
        if ($header eq '') {

            $header = $1;
            $sequence = '';
            next;
        } else {

            $sequence_len = length($sequence);

            $sequence_hash{$header} = $sequence;
            push (@headers,$header);
            #print $header."\t".$sequence_len."\n";
            #print $sequence."\n";
            $num_sequences++;

            $header = $1;
            $sequence = '';

        }


    } else {

        $sequence .= $line;

    }

}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";

$num_sequences++;

close (SEQ);

$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";   
print "$pos \n";
like image 305
user630605 Avatar asked Nov 12 '22 19:11

user630605


1 Answers

This awk script can probably help you get desired result.

awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
    print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp

Test:

[jaypal:~/temp] cat sequence
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

[jaypal:~/temp] cat snp
SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

[jaypal:~/temp] awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
        print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
like image 191
jaypal singh Avatar answered Nov 15 '22 11:11

jaypal singh