Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use BeautifulSoup to Iterate over XML to pull specific tags and store in variable

I'm fairly new to programming and have been trying to find a solution for this but all I can find are bits and pieces with no real luck putting it all together.

I'm trying to use BeautifulSoup4 in python to scrape some xml and store the text value in between specific tags in variables. The data is from a med student training program and right now everything needed has to be found manually. So I'm trying to increase efficiency a bit with a scraping program.

Let's say for example that I was looking at this type of test data to experiment with:

<AllergyList>
<Allergy>
    <Deleted>n</Deleted>
    <Status>
        <Active/>
    </Status>
    <ExternalID/>
    <Patient>
        <ExternalID/>
        <FirstName>Testcase</FirstName>
        <LastName>casetest</LastName>
    </Patient>
    <Allergen>
        <Name>Flagyl (metronidazole)</Name>
        <Drug>
           <NDCID>00025182151,00025182131,00025182150</NDCID>
        </Drug>
    </Allergen>
    <Reaction>difficulty breathing</Reaction>
    <OnsetDate>02/02/2013</OnsetDate>
 </Allergy>
<Allergy>
    <Deleted>n</Deleted>
    <Status>
        <Active/>
    </Status>
    <ExternalID/>
    <Patient>
        <ExternalID/>
        <FirstName>Testcase</FirstName>
        <LastName>casetest</LastName>
    </Patient>
    <Allergen>
        <Name>Bactrim (sulfamethoxazole-trimethoprim)</Name>
        <Drug>
            <NDCID>13310014501,49999023220</NDCID>
        </Drug>
    </Allergen>
    <Reaction>swelling</Reaction>
    <OnsetDate>05/03/2002</OnsetDate>
  </Allergy>
  <Number>2</Number>
</AllergyList>

I've been trying to pull the <Name> tag from in between multiple <Allergen> tags as well as the respective data from in between the <Onsetdate> and <Reaction> tags while storing the results of the pull into respective variables.

So for example I would want to pull Flagyl (metronidazole), difficulty breathing, 02/02/2013, then Bactrim (sulfamethoxazole-trimethoprim), swelling, 05/03/2002, and so on while placing them in separate variables that I can use later.

Pulling the first set from the <Allergen> tag is easy but I'm having trouble figuring out how to iterate over the xml and storing the pulled data into variables. I've been trying to use a for loop while storing the data into an array or list but the way I've been writing it I always pull the same data over and over again depending on the number of iterations I calculate from the len() function and have since failed to store any of it into an array.

I've been racking my brain about this for a while now and I think I may just not be that smart so any help or even pointing me in the right direction would be immensely appreciated.

like image 875
user2969206 Avatar asked Nov 08 '13 14:11

user2969206


1 Answers

It seems a simple task because there isn't many nesting tags:

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')

allergies = []
for allergy in soup.find_all('Allergy'):
    d = { 
        'name': allergy.Allergen.Name.string,
        'reaction': allergy.Reaction.string,
        'on_set_date': allergy.OnsetDate.string,
    }   
    allergies.append(d)

## Use 'allergies' array of dictionaries as you want.
## Example:
print(allergies[1]['reaction'])

Run it with the xml file as argument:

python3 script.py xmlfile

And this test yields:

swelling
like image 183
Birei Avatar answered Oct 13 '22 10:10

Birei