Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the first hit elements from an XML NCBI BLAST file?

Im trying to extract only the first hit from an NCBI xml BLAST file. next I would like to get only the first HSP. at the final stage I would like to get these based on best score. to make things clear here a sample of the xml file:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
  <BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version>
  <BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs&quot;,  Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db>
  <BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
  <BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def>
  <BlastOutput_query-len>1024</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>1e-05</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>lcl|1_0</Iteration_query-ID>
      <Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def>
      <Iteration_query-len>1024</Iteration_query-len>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>68007</Statistics_db-num>
          <Statistics_db-len>19518578</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
      <Iteration_message>No hits found</Iteration_message>
    </Iteration>
    <Iteration>
<Iteration>
      <Iteration_iter-num>6</Iteration_iter-num>
      <Iteration_query-ID>lcl|6_0</Iteration_query-ID>
      <Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def>
      <Iteration_query-len>1068</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gnl|BL_ORD_ID|23609</Hit_id>
          <Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def>
          <Hit_accession>23609</Hit_accession>
          <Hit_len>293</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>49.2914</Hsp_bit-score>
              <Hsp_score>116</Hsp_score>
              <Hsp_evalue>5.15408e-06</Hsp_evalue>
              <Hsp_query-from>580</Hsp_query-from>
              <Hsp_query-to>792</Hsp_query-to>
              <Hsp_hit-from>202</Hsp_hit-from>
              <Hsp_hit-to>273</Hsp_hit-to>
              <Hsp_query-frame>-1</Hsp_query-frame>
              <Hsp_identity>26</Hsp_identity>
              <Hsp_positive>45</Hsp_positive>
              <Hsp_gaps>2</Hsp_gaps>
              <Hsp_align-len>73</Hsp_align-len>
              <Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq>
              <Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq>
              <Hsp_midline>M I+ DV+   +T +++DD+ D   T   AAK L+E GA +V  Y TH + S + +EN+ ++ +D +  TN++</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>2</Hit_num>
          <Hit_id>gnl|BL_ORD_ID|2466</Hit_id>
          <Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def>
          <Hit_accession>2466</Hit_accession>
          <Hit_len>3084</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>48.9062</Hsp_bit-score>
              <Hsp_score>115</Hsp_score>
              <Hsp_evalue>6.70494e-06</Hsp_evalue>
              <Hsp_query-from>369</Hsp_query-from>
              <Hsp_query-to>875</Hsp_query-to>
              <Hsp_hit-from>2312</Hsp_hit-from>
              <Hsp_hit-to>2465</Hsp_hit-to>
              <Hsp_query-frame>-2</Hsp_query-frame>
              <Hsp_identity>52</Hsp_identity>
              <Hsp_positive>70</Hsp_positive>
              <Hsp_gaps>4</Hsp_gaps>
              <Hsp_align-len>173</Hsp_align-len>
          <Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq>
          <Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq>
          <Hsp_midline>AP +Q+P A    ++ +   K  P  +  T + T  A P   + T     A    PP+      PP S P A  R  P AP      P       P P+  P    P+   A +T PR + P  SA +AT    AP    SAK++  D    P+AG    PP+      GG V</Hsp_midline>
        </Hsp>
      </Hit_hsps>
    </Hit>
  </Iteration_hits>
  <Iteration_stat>
    <Statistics>
      <Statistics_db-num>68007</Statistics_db-num>
      <Statistics_db-len>19518578</Statistics_db-len>
      <Statistics_hsp-len>0</Statistics_hsp-len>
      <Statistics_eff-space>0</Statistics_eff-space>
      <Statistics_kappa>0.041</Statistics_kappa>
      <Statistics_lambda>0.267</Statistics_lambda>
      <Statistics_entropy>0.14</Statistics_entropy>
    </Statistics>
  </Iteration_stat>
</Iteration>

basically each query search creates an Iteration element. each iteration can have multiple hit which in turn can have multiple HSPs. I would like to get only the first hit and it's first HSP from each iteration. if the BLAST found no hits I would like to ignore the iteration. I worked up this simple code:

#!/usr/bin/env python
from elementtree.ElementTree import parse
from elementtree import ElementTree as ET
file = open("/Applications/blast/blanes_viral_nr_results.xml", "r")
save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w')
tree = parse(file)
elem = tree.getroot()
print elem
Per_ID = ()

save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t"))
iteration = tree.findall('BlastOutput_iterations/Iteration')
for iteration in iteration:
   for hit in iteration.findall('Iteration_hits/Hit'):
  It_Num = iteration.findtext('Iteration_iter-num')
  It_ID = iteration.findtext('Iteration_query-def')
  Hit_Def = hit.findtext('Hit_def')
  Num =  hit.findtext('Hit_num')
  ID = hit.findtext('Hit_id')
  DEF =  hit.findtext('Hit_def')
  ACC = hit.findtext('Hit_accession')
  save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,))
  for hsp in hit.findall('Hit_hsps'):
        HSPN = hsp.findtext('Hsp/Hsp_num')
        identities = hsp.findtext('Hsp/Hsp_identity')
        #print 'id: ', identities.rjust(4),
        length = hsp.findtext('Hsp/Hsp_align-len')
        #print 'len:', length.rjust(4),
        Per_ID = int(identities) * 100.0 / int(length)
        #print hsp.findtext('Hsp/Hsp_qseq')[:50]
        #print hsp.findtext('Hsp/Hsp_midline')[:50]
        #print hsp.findtext('Hsp/Hsp_hseq')[:50]
        save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID))
  save_file.write('n\n' % ())

any help would be greatly appriciated!

like image 479
Schrodinger's Cat Avatar asked Dec 20 '09 09:12

Schrodinger's Cat


People also ask

How do you save BLAST results?

To save, click on the "[Save Search Strategies]" link near the top of the blast results page.

What is the difference between megablast and BLASTN?

The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size (11). Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms.

What are the default scores used by BLAST for a base match and a base mismatch?

The default value is 50. For BLASTN, the scores are reward for a nucleotide match and penalty for a nucleotide mismatch. Scoring matrix name.

What is BLAST hit?

BLAST is a sequence similarity search tool, and it calculates an E-value and a bit-score to assess the quality of each match. An E-value represents the number of hits of equal or greater score expected to arise by chance.


Video Answer


2 Answers

While building your own parser may be "fun" there is already a package out there that can parse BLAST xml files ... it can even do the intermediate calling of a local BLAST instance for you if you so desire.

The main site is here: http://biopython.org/wiki/Biopython

and the XML BLAST parser is here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82

Something like:

from Bio.Blast import NCBIXML
with open('xml/results/file') as handle:
    all_records = NCBIXML.parse(handle)
    first_record = all_records.next()

Should work. I generally love the BioPython parsers and writers but I do not like the class-structure organization. So I generally just use the parser and extract the info I need into my own structure. YMMV

Hope that helps.

like image 98
JudoWill Avatar answered Oct 02 '22 09:10

JudoWill


I'll second what JudoWill suggested - work smarter, not harder with the BioPython parser. This should get you a little further:

from Bio.Blast import NCBIXML
blast = NCBIXML.parse(open('results.xml','rU'))
for record in blast:
    if record.alignments:
        # to print the "best" matches e-score
        print record.alignments[0].hsps[0].expect
        # to print the "best" matches bit-score
        print record.alignments[0].hsps[0].score
        break

This will stop after the first Query (returning the first and best match). Since I suspect you may want results for other queries within the same file, just remove the break from the last line.

like image 39
brant.faircloth Avatar answered Oct 02 '22 11:10

brant.faircloth