Im trying to extract only the first hit from an NCBI xml BLAST file. next I would like to get only the first HSP. at the final stage I would like to get these based on best score. to make things clear here a sample of the xml file:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version>
<BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db>
<BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
<BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def>
<BlastOutput_query-len>1024</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>1e-05</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>lcl|1_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1024</Iteration_query-len>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
<Iteration_message>No hits found</Iteration_message>
</Iteration>
<Iteration>
<Iteration>
<Iteration_iter-num>6</Iteration_iter-num>
<Iteration_query-ID>lcl|6_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1068</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|23609</Hit_id>
<Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def>
<Hit_accession>23609</Hit_accession>
<Hit_len>293</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>49.2914</Hsp_bit-score>
<Hsp_score>116</Hsp_score>
<Hsp_evalue>5.15408e-06</Hsp_evalue>
<Hsp_query-from>580</Hsp_query-from>
<Hsp_query-to>792</Hsp_query-to>
<Hsp_hit-from>202</Hsp_hit-from>
<Hsp_hit-to>273</Hsp_hit-to>
<Hsp_query-frame>-1</Hsp_query-frame>
<Hsp_identity>26</Hsp_identity>
<Hsp_positive>45</Hsp_positive>
<Hsp_gaps>2</Hsp_gaps>
<Hsp_align-len>73</Hsp_align-len>
<Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq>
<Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq>
<Hsp_midline>M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>gnl|BL_ORD_ID|2466</Hit_id>
<Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def>
<Hit_accession>2466</Hit_accession>
<Hit_len>3084</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>48.9062</Hsp_bit-score>
<Hsp_score>115</Hsp_score>
<Hsp_evalue>6.70494e-06</Hsp_evalue>
<Hsp_query-from>369</Hsp_query-from>
<Hsp_query-to>875</Hsp_query-to>
<Hsp_hit-from>2312</Hsp_hit-from>
<Hsp_hit-to>2465</Hsp_hit-to>
<Hsp_query-frame>-2</Hsp_query-frame>
<Hsp_identity>52</Hsp_identity>
<Hsp_positive>70</Hsp_positive>
<Hsp_gaps>4</Hsp_gaps>
<Hsp_align-len>173</Hsp_align-len>
<Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq>
<Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq>
<Hsp_midline>AP +Q+P A ++ + K P + T + T A P + T A PP+ PP S P A R P AP P P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+ GG V</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
basically each query search creates an Iteration element. each iteration can have multiple hit which in turn can have multiple HSPs. I would like to get only the first hit and it's first HSP from each iteration. if the BLAST found no hits I would like to ignore the iteration. I worked up this simple code:
#!/usr/bin/env python
from elementtree.ElementTree import parse
from elementtree import ElementTree as ET
file = open("/Applications/blast/blanes_viral_nr_results.xml", "r")
save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w')
tree = parse(file)
elem = tree.getroot()
print elem
Per_ID = ()
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t"))
iteration = tree.findall('BlastOutput_iterations/Iteration')
for iteration in iteration:
for hit in iteration.findall('Iteration_hits/Hit'):
It_Num = iteration.findtext('Iteration_iter-num')
It_ID = iteration.findtext('Iteration_query-def')
Hit_Def = hit.findtext('Hit_def')
Num = hit.findtext('Hit_num')
ID = hit.findtext('Hit_id')
DEF = hit.findtext('Hit_def')
ACC = hit.findtext('Hit_accession')
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,))
for hsp in hit.findall('Hit_hsps'):
HSPN = hsp.findtext('Hsp/Hsp_num')
identities = hsp.findtext('Hsp/Hsp_identity')
#print 'id: ', identities.rjust(4),
length = hsp.findtext('Hsp/Hsp_align-len')
#print 'len:', length.rjust(4),
Per_ID = int(identities) * 100.0 / int(length)
#print hsp.findtext('Hsp/Hsp_qseq')[:50]
#print hsp.findtext('Hsp/Hsp_midline')[:50]
#print hsp.findtext('Hsp/Hsp_hseq')[:50]
save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID))
save_file.write('n\n' % ())
any help would be greatly appriciated!
To save, click on the "[Save Search Strategies]" link near the top of the blast results page.
The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size (11). Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms.
The default value is 50. For BLASTN, the scores are reward for a nucleotide match and penalty for a nucleotide mismatch. Scoring matrix name.
BLAST is a sequence similarity search tool, and it calculates an E-value and a bit-score to assess the quality of each match. An E-value represents the number of hits of equal or greater score expected to arise by chance.
While building your own parser may be "fun" there is already a package out there that can parse BLAST xml files ... it can even do the intermediate calling of a local BLAST instance for you if you so desire.
The main site is here: http://biopython.org/wiki/Biopython
and the XML BLAST parser is here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82
Something like:
from Bio.Blast import NCBIXML
with open('xml/results/file') as handle:
all_records = NCBIXML.parse(handle)
first_record = all_records.next()
Should work. I generally love the BioPython parsers and writers but I do not like the class-structure organization. So I generally just use the parser and extract the info I need into my own structure. YMMV
Hope that helps.
I'll second what JudoWill suggested - work smarter, not harder with the BioPython parser. This should get you a little further:
from Bio.Blast import NCBIXML
blast = NCBIXML.parse(open('results.xml','rU'))
for record in blast:
if record.alignments:
# to print the "best" matches e-score
print record.alignments[0].hsps[0].expect
# to print the "best" matches bit-score
print record.alignments[0].hsps[0].score
break
This will stop after the first Query (returning the first and best match). Since I suspect you may want results for other queries within the same file, just remove the break
from the last line.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With