I working on xml sax parser to parse xml files and below is my code
xml file code:
<job>
<title>Registered Nurse-Epilepsy</title>
<job-code>881723</job-code>
<detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
</detail-url>
<job-category>Neuroscience Nursing</job-category>
<description>
<summary>
<div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL>
</summary>
</description>
<posted-date>2012-07-26</posted-date>
<location>
<address>7777 Forest Lane</address>
<city>Dallas</city>
<state>TX</state>
<zip>75230</zip>
<country>US</country>
</location>
<company>
<name>Medical City (Dallas, TX)</name>
<url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url>
</company>
</job>
Python code: (partial code to clear my doubt until start element function)
from xml.sax.handler import ContentHandler
import xml.sax
import xml.parsers.expat
import ConfigParser
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
print name,attrs
self.clearFields()
def endElement(self, name):
pass
def characters(self, data):
self.buffer += data
def clearFields():
self.fields = {}
self.fields['title'] = None
self.fields['job-code'] = None
self.fields['detail-url'] = None
self.fields['job-category'] = None
self.fields['description'] = None
self.fields['summary'] = None
self.fields['posted-date'] = None
self.fields['location'] = None
self.fields['address'] = None
self.fields['city'] = None
self.fields['state'] = None
self.fields['zip'] = None
self.fields['country'] = None
self.fields['company'] = None
self.fields['name'] = None
self.fields['url'] = None
self.buffer = ''
if __name__ == '__main__':
parser = xml.sax.make_parser()
handler = Exact()
parser.setContentHandler(handler)
parser.parse(open('/path/to/xml_file.xml'))
result: The result to the above print statement is given below
job <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
title <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
description <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
summary <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
posted-date <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
location <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
address <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
city <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
state <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
zip <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
country <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
company <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
name <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
As you can observe above i am getting name
and attrs
from the print statement, but
now all my intention is to get value of that name, how to fetch the values for all those tags above because i am getting only node names but not values.
Edited Code:
i really confused on how to map the data from the nodes to the keys in the dictionary as stated above
To get the content of an element, you need to overwrite the characters
method... add this to your handler class:
def characters(self, data):
print data
Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
def _flushCharBuffer(self):
s = ''.join(self._charBuffer)
self._charBuffer = []
return s
def characters(self, data):
self._charBuffer.append(data)
... and then call the flush method on the end of elements where I need the data.
For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._result = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() #remove strip() if whitespace is important
def parse(self, f):
xml.sax.parse(f, self)
return self._result
def characters(self, data):
self._charBuffer.append(data)
def startElement(self, name, attrs):
if name == 'job': self._result.append({})
def endElement(self, name):
if not name == 'job': self._result[-1][name] = self._getCharacterData()
jobs = MyHandler().parse("job-file.xml") #a list of all jobs
If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement
method - just set _result to a dict and assign to it directly in endElement
.
To get the text content of a node, you need to implement a characters method. E.g.
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
print name,attrs
def endElement(self, name):
print 'end ' + name
def characters(self, content):
print content
Would output:
job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>
title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title
job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
881723
end job-code
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
end detail-url
(sniped)
You need to implement a characters
handler too:
def characters(self, content):
print content
but this potentially gives you text in chunks instead of as one block per tag.
Do yourself a big favour though and use the ElementTree API instead; that API is far pythononic and easier to use than the XML DOM API.
from xml.etree import ElementTree as ET
etree = ET.parse('/path/to/xml_file.xml')
jobtitle = etree.find('job/title').text
If all you want is a straight conversion to a dictionary, take a look at this handy ActiveState Python Cookbook recipe: Converting XML to dictionary and back. Note that it uses the ElementTree API as well.
If you have a set of existing elements you want to look for, just use these in the find()
method:
fieldnames = [
'title', 'job-code', 'detail-url', 'job-category', 'description',
'summary', 'posted-date', 'location', 'address', 'city', 'state',
'zip', 'country', 'company', 'name', 'url']
fields = {}
etree = ET.parse('/path/to/xml_file.xml')
for field in fieldnames:
elem = etree.find(field)
if field is not None and field.text is not None:
fields[field] = elem.text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With