How to read XML file from URL in python?

Tags:

python

xml-parsing

I want to access the information present in the sub node. Is this because of the structure of the file?

Tried extracting the author subnode information in a file separately and run python code. That works fine

import urllib
import xml.etree.ElementTree as ET

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

print 'Retrieving', url

document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'

print document[:50]

tree = ET.fromstring(document)

lst = tree.findall('title')
print lst[:100]

908

asked Feb 19 '19 10:02

PANKAJ KUMAR

2 Answers

You couldn't find title elements because of the namespace.

Below a sample code to find:

Title from "document" tag
Title from inner "component" tag

    import xml.etree.ElementTree as ET
    import urllib.request

    url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
    response = urllib.request.urlopen(url).read()
    tree = ET.fromstring(response)


    for docTitle in tree.findall('{urn:hl7-org:v3}title'):
        print(docTitle.text)

    for compTitle in tree.findall('.//{urn:hl7-org:v3}title'):
        print(compTitle.text)

UPDATE

If you need to search XML nodes you should use xPath Expressions

Example:

NS = '{urn:hl7-org:v3}'
ID = '829076996'    # ID TO BE FOUND

# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)
xPathAuthorById = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "id[@extension='", ID,
    "']/../../.."
    ])

# XPATH TO FIND AUTHOR NAME ELEMENT
xPathAuthorName = ''.join([
    "./",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "name"
    ])

# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)
for author in tree.findall(xPathAuthorById):
    name = author.find(xPathAuthorName)
    print(name.text)

This example prints the author name for the ID 829076996

UPDATE 2

You can easily process all assignedEntity tags with a findall method. For each of them you can have multiple products, so another findall method is needed (see example below).

xPathAssignedEntities = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "assignedEntity/", 
    NS, "assignedOrganization/", 
    NS, "assignedEntity"
    ])

xPathProdCode = ''.join([
    NS, "actDefinition/",
    NS, "product/",
    NS, "manufacturedProduct/",
    NS, "manufacturedMaterialKind/",
    NS, "code"
    ])


# GET ALL assignedEntity TAGS
for assignedEntity in tree.findall(xPathAssignedEntities):

    # GET ID AND NAME OF assignedEntity
    id = assignedEntity.find(NS + 'assignedOrganization/'+ NS + 'id').get('extension')
    name = assignedEntity.find(NS + 'assignedOrganization/' + NS + 'name').text

    # FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS
    for performance in assignedEntity.findall(NS + 'performance'):
        actCode = performance.find(NS + 'actDefinition/'+ NS + 'code').get('displayName')
        prodCode = performance.find(xPathProdCode).get('code')
        print(id, '\t', name, '\t', actCode, '\t', prodCode)

This is the result:

829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-0050 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4940 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4960 
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-0050
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4900
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4910
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4940
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4960
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4900 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4910 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4960 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4960 
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-0050
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-4940
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4960

200

answered Sep 21 '22 00:09

manuel_b

You could use xmltodict in order to generate a python dictionary from the requested XML data..

Here's a basic example:

import urllib2
import xmltodict

def foobar(request):
    file = urllib2.urlopen('https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml')
    data = file.read()
    file.close()

    data = xmltodict.parse(data)
    return {'xmldata': data}

answered Sep 22 '22 00:09

iLuvLogix

Related questions
                            
                                How to generate random numbers with each random number having a difference of at least x with all other elements?
                            
                                'string' has incorrect type (expected str, got spacy.tokens.doc.Doc)
                            
                                Using more worker processes than there are cores
                            
                                Converting python UTC timestamp to and from string
                            
                                float() object id creation order
                            
                                How to read email using python and smtplib
                            
                                What is the meaning of the parameter 'dims' in function Permute in keras?
                            
                                Export pandas dataframe to json and back to a dataframe with columns in the same order
                            
                                How to run a Python project using __pycache__ folder?
                            
                                Specifying NumPy Arrays with 2-Bit Dtype
                            
                                OpenCV VideoCapture and error: (-215:Assertion failed) !_src.empty() in function 'cv::cvtColor'
                            
                                Tensorflow predict the class of output
                            
                                set parameters in EventInput in Dialogflow V2 API
                            
                                `Cannot open include file: 'apr_perms_set.h'` when doing `pip install mod_wsgi`
                            
                                Unable to Import in VS Code
                            
                                how to download S3 file in Serverless Lambda (Python)
                            
                                Debug info from custom ansible module
                            
                                How do I improve the performance of pandas GroupBy filter operation?
                            
                                One line, three variables
                            
                                How to set selenium webdriver from headless mode to normal mode within the same session?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With