I want to access the information present in the sub node. Is this because of the structure of the file?
Tried extracting the author subnode information in a file separately and run python code. That works fine
import urllib
import xml.etree.ElementTree as ET
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
print 'Retrieving', url
document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'
print document[:50]
tree = ET.fromstring(document)
lst = tree.findall('title')
print lst[:100]
To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().
The Pandas data analysis library provides functions to read/write data for most of the file types. For example, it includes read_csv() and to_csv() for interacting with CSV files. However, Pandas does not include any methods to read and write XML files.
XML files can be opened in a browser like IE or Chrome, with any text editor like Notepad or MS-Word. Even Excel can be used to open XML files.
You couldn't find title elements because of the namespace.
Below a sample code to find:
import xml.etree.ElementTree as ET
import urllib.request
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
response = urllib.request.urlopen(url).read()
tree = ET.fromstring(response)
for docTitle in tree.findall('{urn:hl7-org:v3}title'):
print(docTitle.text)
for compTitle in tree.findall('.//{urn:hl7-org:v3}title'):
print(compTitle.text)
UPDATE
If you need to search XML nodes you should use xPath Expressions
Example:
NS = '{urn:hl7-org:v3}'
ID = '829076996' # ID TO BE FOUND
# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)
xPathAuthorById = ''.join([
".//",
NS, "author/",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "id[@extension='", ID,
"']/../../.."
])
# XPATH TO FIND AUTHOR NAME ELEMENT
xPathAuthorName = ''.join([
"./",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "name"
])
# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)
for author in tree.findall(xPathAuthorById):
name = author.find(xPathAuthorName)
print(name.text)
This example prints the author name for the ID 829076996
UPDATE 2
You can easily process all assignedEntity tags with a findall method. For each of them you can have multiple products, so another findall method is needed (see example below).
xPathAssignedEntities = ''.join([
".//",
NS, "author/",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "assignedEntity/",
NS, "assignedOrganization/",
NS, "assignedEntity"
])
xPathProdCode = ''.join([
NS, "actDefinition/",
NS, "product/",
NS, "manufacturedProduct/",
NS, "manufacturedMaterialKind/",
NS, "code"
])
# GET ALL assignedEntity TAGS
for assignedEntity in tree.findall(xPathAssignedEntities):
# GET ID AND NAME OF assignedEntity
id = assignedEntity.find(NS + 'assignedOrganization/'+ NS + 'id').get('extension')
name = assignedEntity.find(NS + 'assignedOrganization/' + NS + 'name').text
# FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS
for performance in assignedEntity.findall(NS + 'performance'):
actCode = performance.find(NS + 'actDefinition/'+ NS + 'code').get('displayName')
prodCode = performance.find(xPathProdCode).get('code')
print(id, '\t', name, '\t', actCode, '\t', prodCode)
This is the result:
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-0050
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4900
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4910
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4940
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4960
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-0050
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4900
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4910
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4940
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4960
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4900
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4910
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4960
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4900
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4910
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4960
618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-0050
618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-4940
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4900
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4910
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4960
You could use xmltodict in order to generate a python dictionary from the requested XML data..
Here's a basic example:
import urllib2
import xmltodict
def foobar(request):
file = urllib2.urlopen('https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml')
data = file.read()
file.close()
data = xmltodict.parse(data)
return {'xmldata': data}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With