Dalke Scientific Software: More science. Less time. Products

ElementTree

XML is a popular format. There are many ways to parse an XML file. Two traditional ones are SAX and DOM. Both are somewhat cumbersome to use. A more "pythonic" parser is ElementTree. It has become widely used over the last few years and will be part of the standard Python distribution with Python 2.5.

I'll go through the slides from my EuroPython talk and from Andrew Kuchling's 'Processing XML with ElementTree'. The ElementTree page has yet more documentation.

Here's a BLAST XML document I got from Peace. The filename is "ecoli.xml".

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>blastp 2.2.14 [May-07-2006]</BlastOutput_version>
  <BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs&quot;,  Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/root/TG-blast/tgblast/ecoli</BlastOutput_db>
  <BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
  <BlastOutput_query-def>gi|16132220|ref|NP_418820.1| predicted rRNA methyltransferase [Escherichia coli K12]</BlastOutput_query-def>
  <BlastOutput_query-len>228</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>lcl|1_0</Iteration_query-ID>
      <Iteration_query-def>gi|16132220|ref|NP_418820.1| predicted rRNA methyltransferase [Escherichia coli K12]</Iteration_query-def>
      <Iteration_query-len>228</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|16132220|ref|NP_418820.1|</Hit_id>
          <Hit_def>predicted rRNA methyltransferase [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_418820</Hit_accession>
          <Hit_len>228</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>416.387</Hsp_bit-score>
              <Hsp_score>1069</Hsp_score>
              <Hsp_evalue>6.63351e-118</Hsp_evalue>
              <Hsp_query-from>1</Hsp_query-from>
              <Hsp_query-to>228</Hsp_query-to>
              <Hsp_hit-from>1</Hsp_hit-from>
              <Hsp_hit-to>228</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>214</Hsp_identity>
              <Hsp_positive>214</Hsp_positive>
              <Hsp_align-len>228</Hsp_align-len>
              <Hsp_qseq>MRITIILVXXXXXXXXXXXXXXMKTMGFSDLRIVDSQAHLEPATRWVAHGSGDIIDNIKVFPTLAESLHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLTNEELALADVLTGVPMVADYPSLNLGQAVMVYCYQLATLIQQPAKSDATADQHQLQALRERAMTLLTTLAVADDIKLVDWLQQRLGLLEQRDTAMLHRLLHDIEKNITK</Hsp_qseq>
              <Hsp_hseq>MRITIILVAPARAENIGAAARAMKTMGFSDLRIVDSQAHLEPATRWVAHGSGDIIDNIKVFPTLAESLHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLTNEELALADVLTGVPMVADYPSLNLGQAVMVYCYQLATLIQQPAKSDATADQHQLQALRERAMTLLTTLAVADDIKLVDWLQQRLGLLEQRDTAMLHRLLHDIEKNITK</Hsp_hseq>
              <Hsp_midline>MRITIILV              MKTMGFSDLRIVDSQAHLEPATRWVAHGSGDIIDNIKVFPTLAESLHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLTNEELALADVLTGVPMVADYPSLNLGQAVMVYCYQLATLIQQPAKSDATADQHQLQALRERAMTLLTTLAVADDIKLVDWLQQRLGLLEQRDTAMLHRLLHDIEKNITK</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>2</Hit_num>
          <Hit_id>gi|16130457|ref|NP_417027.1|</Hit_id>
          <Hit_def>predicted methyltransferase [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_417027</Hit_accession>
          <Hit_len>246</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>70.4774</Hsp_bit-score>
              <Hsp_score>171</Hsp_score>
              <Hsp_evalue>8.92882e-14</Hsp_evalue>
              <Hsp_query-from>3</Hsp_query-from>
              <Hsp_query-to>155</Hsp_query-to>
              <Hsp_hit-from>5</Hsp_hit-from>
              <Hsp_hit-to>156</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>52</Hsp_identity>
              <Hsp_positive>70</Hsp_positive>
              <Hsp_gaps>1</Hsp_gaps>
              <Hsp_align-len>153</Hsp_align-len>
              <Hsp_qseq>ITIILVXXXXXXXXXXXXXXMKTMGFSDLRIVDSQAHLEPATRWVAHGSGDIIDNIKVFPTLAESLHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLTNEELALADVLTGVPMVADYPSLNLGQAVMVYCYQL</Hsp_qseq>
              <Hsp_hseq>IRIVLVETSHTGNMGSVARAMKTMGLTNLWLVNPLVKPDSQAIALAAGASDVIGNAHIVDTLDEALAGCSLVVGTSARSRTLPWPMLDPREC-GLKSVAEAANTPVALVFGRERVGLTNEELQKCHYHVAIAANPEYSSLNLAMAVQVIAYEV</Hsp_hseq>
              <Hsp_midline>I I+LV              MKTMG ++L +V+     +     +A G+ D+I N  +  TL E+L      V T+ARSR        P E   L     +  +  ALVFGRE  GLTNEEL        +    +Y SLNL  AV V  Y++</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>3</Hit_num>
          <Hit_id>gi|16131522|ref|NP_418108.1|</Hit_id>
          <Hit_def>tRNA (Guanosine-2&apos;-O-)-methyltransferase [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_418108</Hit_accession>
          <Hit_len>229</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>34.6538</Hsp_bit-score>
              <Hsp_score>78</Hsp_score>
              <Hsp_evalue>0.0054295</Hsp_evalue>
              <Hsp_query-from>110</Hsp_query-from>
              <Hsp_query-to>154</Hsp_query-to>
              <Hsp_hit-from>116</Hsp_hit-from>
              <Hsp_hit-to>160</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>17</Hsp_identity>
              <Hsp_positive>27</Hsp_positive>
              <Hsp_align-len>45</Hsp_align-len>
              <Hsp_qseq>LVFGREDSGLTNEELALADVLTGVPMVADYPSLNLGQAVMVYCYQ</Hsp_qseq>
              <Hsp_hseq>ILMGQEKTGITQEALALADQDIIIPMIGMVQSLNVSVASALILYE</Hsp_hseq>
              <Hsp_midline>++ G+E +G+T E LALAD    +PM+    SLN+  A  +  Y+</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>4</Hit_num>
          <Hit_id>gi|16131477|ref|NP_418063.1|</Hit_id>
          <Hit_def>predicted rRNA methylase [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_418063</Hit_accession>
          <Hit_len>157</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>28.8758</Hsp_bit-score>
              <Hsp_score>63</Hsp_score>
              <Hsp_evalue>0.297927</Hsp_evalue>
              <Hsp_query-from>110</Hsp_query-from>
              <Hsp_query-to>154</Hsp_query-to>
              <Hsp_hit-from>97</Hsp_hit-from>
              <Hsp_hit-to>143</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>19</Hsp_identity>
              <Hsp_positive>24</Hsp_positive>
              <Hsp_gaps>2</Hsp_gaps>
              <Hsp_align-len>47</Hsp_align-len>
              <Hsp_qseq>LVFGREDSGLTNEELAL--ADVLTGVPMVADYPSLNLGQAVMVYCYQ</Hsp_qseq>
              <Hsp_hseq>LMFGPETRGLPASILDALPAEQKIRIPMVPDSRSMNLSNAVSVVVYE</Hsp_hseq>
              <Hsp_midline>L+FG E  GL    L    A+    +PMV D  S+NL  AV V  Y+</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>5</Hit_num>
          <Hit_id>gi|16132002|ref|NP_418601.1|</Hit_id>
          <Hit_def>23S rRNA (Gm2251)-methyltransferase [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_418601</Hit_accession>
          <Hit_len>243</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>27.7202</Hsp_bit-score>
              <Hsp_score>60</Hsp_score>
              <Hsp_evalue>0.663712</Hsp_evalue>
              <Hsp_query-from>35</Hsp_query-from>
              <Hsp_query-to>154</Hsp_query-to>
              <Hsp_hit-from>129</Hsp_hit-from>
              <Hsp_hit-to>237</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>30</Hsp_identity>
              <Hsp_positive>47</Hsp_positive>
              <Hsp_gaps>17</Hsp_gaps>
              <Hsp_align-len>123</Hsp_align-len>
              <Hsp_qseq>DSQAHLEPATRWVAHGSGDIIDNIKVFPTLAES---LHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLTNEELALADVLTGVPMVADYPSLNLGQAVMVYCYQ</Hsp_qseq>
              <Hsp_hseq>DRSAQLNATAKKVACGAAESVPLIRV-TNLARTMRMLQEENIWIVGTA-GEADHTLY------------QSKMTGRLALVMGAEGEGMRRLTREHCDELISIPMAGSVSSLNVSVATGICLFE</Hsp_hseq>
              <Hsp_midline>D  A L    + VA G+ + +  I+V   LA +   L + +  +  TA   A +  Y            +S      ALV G E  G+        D L  +PM     SLN+  A  +  ++</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>6</Hit_num>
          <Hit_id>gi|16131061|ref|NP_417638.1|</Hit_id>
          <Hit_def>transcription elongation factor NusA [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_417638</Hit_accession>
          <Hit_len>495</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>26.1794</Hsp_bit-score>
              <Hsp_score>56</Hsp_score>
              <Hsp_evalue>1.93111</Hsp_evalue>
              <Hsp_query-from>170</Hsp_query-from>
              <Hsp_query-to>198</Hsp_query-to>
              <Hsp_hit-from>399</Hsp_hit-from>
              <Hsp_hit-to>427</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>13</Hsp_identity>
              <Hsp_positive>18</Hsp_positive>
              <Hsp_align-len>29</Hsp_align-len>
              <Hsp_qseq>DQHQLQALRERAMTLLTTLAVADDIKLVD</Hsp_qseq>
              <Hsp_hseq>DEPTVEALRERAKNALATIAQAQEESLGD</Hsp_hseq>
              <Hsp_midline>D+  ++ALRERA   L T+A A +  L D</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>7</Hit_num>
          <Hit_id>gi|16129062|ref|NP_415617.1|</Hit_id>
          <Hit_def>DNA polymerase III subunit delta&apos; [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_415617</Hit_accession>
          <Hit_len>334</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>24.2534</Hsp_bit-score>
              <Hsp_score>51</Hsp_score>
              <Hsp_evalue>7.33819</Hsp_evalue>
              <Hsp_query-from>78</Hsp_query-from>
              <Hsp_query-to>126</Hsp_query-to>
              <Hsp_hit-from>154</Hsp_hit-from>
              <Hsp_hit-to>204</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>18</Hsp_identity>
              <Hsp_positive>22</Hsp_positive>
              <Hsp_gaps>2</Hsp_gaps>
              <Hsp_align-len>51</Hsp_align-len>
              <Hsp_qseq>TARSRAKYHYYATPVE--LVPLLEEKSSWMSHAALVFGREDSGLTNEELAL</Hsp_qseq>
              <Hsp_hseq>TLRSRCRLHYLAPPPEQYAVTWLSREVTMSQDALLAALRLSAGSPGAALAL</Hsp_hseq>
              <Hsp_midline>T RSR + HY A P E   V  L  + +    A L   R  +G     LAL</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>8</Hit_num>
          <Hit_id>gi|16129252|ref|NP_415807.1|</Hit_id>
          <Hit_def>predicted antimicrobial peptide transporter subunit [Escherichia coli K12]</Hit_def>
          <Hit_accession>NP_415807</Hit_accession>
          <Hit_len>330</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>23.8682</Hsp_bit-score>
              <Hsp_score>50</Hsp_score>
              <Hsp_evalue>9.58398</Hsp_evalue>
              <Hsp_query-from>122</Hsp_query-from>
              <Hsp_query-to>181</Hsp_query-to>
              <Hsp_hit-from>165</Hsp_hit-from>
              <Hsp_hit-to>226</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_identity>14</Hsp_identity>
              <Hsp_positive>29</Hsp_positive>
              <Hsp_gaps>2</Hsp_gaps>
              <Hsp_align-len>62</Hsp_align-len>
              <Hsp_qseq>EELALADVLTGVP--MVADYPSLNLGQAVMVYCYQLATLIQQPAKSDATADQHQLQALRERA</Hsp_qseq>
              <Hsp_hseq>QKVMIAIALANQPRLLIADEPTNSMEPTTQAQIFRLLTRLNQNSNTTILLISHDLQMLSQWA</Hsp_hseq>
              <Hsp_midline>+++ +A  L   P  ++AD P+ ++        ++L T + Q + +      H LQ L + A</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>4243</Statistics_db-num>
          <Statistics_db-len>1342017</Statistics_db-len>
          <Statistics_hsp-len>81</Statistics_hsp-len>
          <Statistics_eff-space>1.46755e+08</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

I want to parse it and extract various bits of information. Here's an easy first one - the BLAST program used and the comparison matrix:
>>> from elementtree import ElementTree
>>> tree = ElementTree.parse("ecoli.xml")
>>> tree.find("BlastOutput_program").text
'blastp'
>>> tree.find("BlastOutput_param/Parameters/Parameters_matrix").text
'BLOSUM62'
>>> 
Using ".text" gets repetitive so ElementTree has a "findtext" function which is like a find followed by a ".text":
>>> tree.findtext("BlastOutput_program")     
'blastp'
>>> tree.findtext("BlastOutput_param/Parameters/Parameters_matrix")
'BLOSUM62'
>>> 
Next I'll list all of the sequence identifers for the hits in the first iteration. This file only has one iteration so I could do findall("BlastOutput_iterations/Iteration/Iteration_hits/Hit") but that would give a silent error if there were multiple iterations because it would match every hit from every iteration.
>>> iteration = tree.find("BlastOutput_iterations/Iteration")
>>> for hit in iteration.findall("Iteration_hits/Hit"):
...   print hit.findtext("Hit_id")
... 
gi|16132220|ref|NP_418820.1|
gi|16130457|ref|NP_417027.1|
gi|16131522|ref|NP_418108.1|
gi|16131477|ref|NP_418063.1|
gi|16132002|ref|NP_418601.1|
gi|16131061|ref|NP_417638.1|
gi|16129062|ref|NP_415617.1|
gi|16129252|ref|NP_415807.1|
>>> 
A hit may have multiple HSPs so I'll list some details about each one.
>>> for hit in iteration.findall("Iteration_hits/Hit"):
...   print hit.findtext("Hit_id")
...   for hsp in hit.findall("Hit_hsps/Hsp"):
...     print "  ", hsp.findtext("Hsp_num"), hsp.findtext("Hsp_evalue")
... 
gi|16132220|ref|NP_418820.1|
   1 6.63351e-118
gi|16130457|ref|NP_417027.1|
   1 8.92882e-14
gi|16131522|ref|NP_418108.1|
   1 0.0054295
gi|16131477|ref|NP_418063.1|
   1 0.297927
gi|16132002|ref|NP_418601.1|
   1 0.663712
gi|16131061|ref|NP_417638.1|
   1 1.93111
gi|16129062|ref|NP_415617.1|
   1 7.33819
gi|16129252|ref|NP_415807.1|
   1 9.58398
>>> 
As you can see, there's only a single HSP in each of these hits.

Using ElementTree in a Kid template

Not a problem. Here's a controller which loads an ElementTree

    @expose(template="slowdiv.templates.blast_table")
    def blast(self):
        from elementtree import ElementTree
        tree = ElementTree.parse("/Users/dalke/nbn/ecoli.xml")
        return dict(tree=tree)
and a template to display it in a table.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://purl.org/kid/ns#"
    py:extends="'master.kid'">

<head>
    <meta content="text/html; charset=UTF-8" http-equiv="content-type" py:replace="''"/>
    <title>Blast table</title>
</head>

<body>
<table border="1">
<tr><th colspan="2">${tree.findtext("BlastOutput_program")} results</th></tr>
<tr py:for="hit in tree.find('BlastOutput_iterations/Iteration').findall('Iteration_hits/Hit')">
<td>${hit.find("Hit_id")}</td>
<td>
 <span py:for="hsp in hit.findall('Hit_hsps/Hsp')">
  Hsp #${hsp.findtext('Hsp_num')}, e-value=${hsp.findtext('Hsp_evalue')}<br /></span>
</td></tr>
</table>
</body>
</html>
The result is
blastp results
gi|16132220|ref|NP_418820.1| Hsp #1, e-value=6.63351e-118
gi|16130457|ref|NP_417027.1| Hsp #1, e-value=8.92882e-14
gi|16131522|ref|NP_418108.1| Hsp #1, e-value=0.0054295
gi|16131477|ref|NP_418063.1| Hsp #1, e-value=0.297927
gi|16132002|ref|NP_418601.1| Hsp #1, e-value=0.663712
gi|16131061|ref|NP_417638.1| Hsp #1, e-value=1.93111
gi|16129062|ref|NP_415617.1| Hsp #1, e-value=7.33819
gi|16129252|ref|NP_415807.1| Hsp #1, e-value=9.58398



Copyright © 2001-2013 Andrew Dalke Scientific AB