Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python to parse non-standard XML file

My input file is actually multiple XML files appending to one file. (It's from Google Patents). It has below structure:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>

Python xml.dom.minidom can't parse this non-standard file. What's a better way to parse this file? I am not below code has good performance or not.

for line in infile:
  if line == '<?xml version="1.0" encoding="UTF-8"?>': 
    xmldoc = minidom.parse(XMLstring)
  else:
    XMLstring += line
like image 625
Stan Avatar asked Sep 07 '11 14:09

Stan


2 Answers

Here's my take on it, using a generator and lxml.etree. Extracted information purely for example.

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)

Yields:

DocID:    US-D0629996-S1-20110104
Title:    Glove backhand
Assignee: Blackhawk Industries Product Group Unlimited LLC

DocID:    US-D0629997-S1-20110104
Title:    Belt sleeve
Assignee: None

DocID:    US-D0629998-S1-20110104
Title:    Underwear
Assignee: X-Technology Swiss GmbH

DocID:    US-D0629999-S1-20110104
Title:    Portion of compression shorts
Assignee: Nike, Inc.

DocID:    US-D0630000-S1-20110104
Title:    Apparel
Assignee: None

DocID:    US-D0630001-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630002-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630003-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630004-S1-20110104
Title:    Headwear cap
Assignee: None

DocID:    US-D0630005-S1-20110104
Title:    Footwear
Assignee: Vibram S.p.A.
like image 162
MattH Avatar answered Sep 28 '22 01:09

MattH


I'd opt for parsing each chunk of XML separately.

You seem to already be doing that in your sample code. Here's my take on your code:

def parse_xml_buffer(buffer):
    dom = minidom.parseString("".join(buffer))  # join list into string of XML
    # .... parse dom ...

buffer = [file.readline()]  # initialise with the first line
for line in file:
    if line.startswith("<?xml "):
        parse_xml_buffer(buffer)
        buffer = []  # reset buffer
    buffer.append(line)  # list operations are faster than concatenating strings
parse_xml_buffer(buffer)  # parse final chunk

Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.


Update:

Starting from scratch, here's how I would do it (using BeautifulSoup):

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

def separated_xml(infile):
    file = open(infile, "r")
    buffer = [file.readline()]
    for line in file:
        if line.startswith("<?xml "):
            yield "".join(buffer)
            buffer = []
        buffer.append(line)
    yield "".join(buffer)
    file.close()

for xml_string in separated_xml("ipgb20110104.xml"):
    soup = BeautifulSoup(xml_string)
    for num in soup.findAll("doc-number"):
        print num.contents[0]

This returns:

D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...
like image 45
Shawn Chin Avatar answered Sep 28 '22 02:09

Shawn Chin