I have an xml-file with a format similar to docx, i.e.:
<w:r>
  <w:rPr>
    <w:sz w:val="36"/>
    <w:szCs w:val="36"/>
  </w:rPr>
  <w:t>BIG_TEXT</w:t>
</w:r>
I need to get an index of BIG_TEXT in source xml, like:
from lxml import etree
text = open('/devel/tmp/doc2/word/document.xml', 'r').read()
root = etree.XML(text)
start = 0
for e in root.iter("*"):
    if e.text:
        offset = text.index(e.text, start)
        l = len(e.text)
        print 'Text "%s" at offset %s and len=%s' % (e.text, offset, l)
        start = offset + l
I can start a new search from position of current index + len(text), but is there another way? Element may have one character, w for example. It will find index of w, but not index of tag text w.
I was looking for a similar solution (indexing nodes in a big xml file for fast lookup).
lxml only offers sourceline, which is insufficient. Cf API : Original line number as found by the parser or None if unknown.
expat provides the exact offset in the file : CurrentByteIndex.
start_element handler, it returns tag's start (ie '<') offset.char_data handler, it returns data's start (ie 'B' in your example) offset.Example :
import xml.parsers.expat
# handler functions for parser events, and housekeeping.
class handler :
   def __init__(self, current_parser) :
      #tag of interest
      self.TARGET_TAG = "w:t"
      #set up parser
      self.parser = current_parser
      self.parser.StartElementHandler  = self.start_element
      self.parser.EndElementHandler    = self.end_element
      self.parser.CharacterDataHandler = self.char_data
      self.target_tag_met = False
      self.index = None
   def start_element(self, name, attrs):
      self.target_tag_met = (name == self.TARGET_TAG)
   def end_element(self, name) :
      self.target_tag_met = False
   def char_data(self, data):
      if self.target_tag_met :
         self.index = self.parser.CurrentByteIndex
#open file in binary mode for robuster byte offsets.
xmlFile = open("so_test.xml", 'rb')
p = xml.parsers.expat.ParserCreate()
h = handler(p)
p.ParseFile(xmlFile)
print (h.index)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With