I'm fairly new to programming and have been trying to find a solution for this but all I can find are bits and pieces with no real luck putting it all together.
I'm trying to use BeautifulSoup4
in python
to scrape some xml
and store the text value in between specific tags in variables. The data is from a med student training program and right now everything needed has to be found manually. So I'm trying to increase efficiency a bit with a scraping program.
Let's say for example that I was looking at this type of test data to experiment with:
<AllergyList>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Flagyl (metronidazole)</Name>
<Drug>
<NDCID>00025182151,00025182131,00025182150</NDCID>
</Drug>
</Allergen>
<Reaction>difficulty breathing</Reaction>
<OnsetDate>02/02/2013</OnsetDate>
</Allergy>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Bactrim (sulfamethoxazole-trimethoprim)</Name>
<Drug>
<NDCID>13310014501,49999023220</NDCID>
</Drug>
</Allergen>
<Reaction>swelling</Reaction>
<OnsetDate>05/03/2002</OnsetDate>
</Allergy>
<Number>2</Number>
</AllergyList>
I've been trying to pull the <Name>
tag from in between multiple <Allergen>
tags as well as the respective data from in between the <Onsetdate>
and <Reaction>
tags while storing the results of the pull into respective variables.
So for example I would want to pull Flagyl (metronidazole)
, difficulty breathing
, 02/02/2013
, then Bactrim (sulfamethoxazole-trimethoprim)
, swelling
, 05/03/2002
, and so on while placing them in separate variables that I can use later.
Pulling the first set from the <Allergen>
tag is easy but I'm having trouble figuring out how to iterate over the xml
and storing the pulled data into variables. I've been trying to use a for loop while storing the data into an array or list but the way I've been writing it I always pull the same data over and over again depending on the number of iterations I calculate from the len()
function and have since failed to store any of it into an array.
I've been racking my brain about this for a while now and I think I may just not be that smart so any help or even pointing me in the right direction would be immensely appreciated.
It seems a simple task because there isn't many nesting tags:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
allergies = []
for allergy in soup.find_all('Allergy'):
d = {
'name': allergy.Allergen.Name.string,
'reaction': allergy.Reaction.string,
'on_set_date': allergy.OnsetDate.string,
}
allergies.append(d)
## Use 'allergies' array of dictionaries as you want.
## Example:
print(allergies[1]['reaction'])
Run it with the xml
file as argument:
python3 script.py xmlfile
And this test yields:
swelling
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With