My webpage looks like this:
<p> <strong class="offender">YOB:</strong> 1987<br/> <strong class="offender">RACE:</strong> WHITE<br/> <strong class="offender">GENDER:</strong> FEMALE<br/> <strong class="offender">HEIGHT:</strong> 5'05''<br/> <strong class="offender">WEIGHT:</strong> 118<br/> <strong class="offender">EYE COLOR:</strong> GREEN<br/> <strong class="offender">HAIR COLOR:</strong> BROWN<br/> </p>
I want to extract the info for each individual and get YOB:1987
, RACE:WHITE
, etc...
What I tried is:
subc = soup.find_all('p') subc1 = subc[1] subc2 = subc1.find_all('strong')
But this gives me only the values of YOB:
, RACE:
, etc...
Is there a way that I can get the data in YOB:1987
, RACE:WHITE
format?
Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
Just loop through all the <strong>
tags and use next_sibling
to get what you want. Like this:
for strong_tag in soup.find_all('strong'): print(strong_tag.text, strong_tag.next_sibling)
Demo:
from bs4 import BeautifulSoup html = ''' <p> <strong class="offender">YOB:</strong> 1987<br /> <strong class="offender">RACE:</strong> WHITE<br /> <strong class="offender">GENDER:</strong> FEMALE<br /> <strong class="offender">HEIGHT:</strong> 5'05''<br /> <strong class="offender">WEIGHT:</strong> 118<br /> <strong class="offender">EYE COLOR:</strong> GREEN<br /> <strong class="offender">HAIR COLOR:</strong> BROWN<br /> </p> ''' soup = BeautifulSoup(html) for strong_tag in soup.find_all('strong'): print(strong_tag.text, strong_tag.next_sibling)
This gives you:
YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN
I think you can get it using subc1.text
.
>>> html = """ <p> <strong class="offender">YOB:</strong> 1987<br /> <strong class="offender">RACE:</strong> WHITE<br /> <strong class="offender">GENDER:</strong> FEMALE<br /> <strong class="offender">HEIGHT:</strong> 5'05''<br /> <strong class="offender">WEIGHT:</strong> 118<br /> <strong class="offender">EYE COLOR:</strong> GREEN<br /> <strong class="offender">HAIR COLOR:</strong> BROWN<br /> </p> """ >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> print soup.text YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN
Or if you want to explore it, you can use .contents
:
>>> p = soup.find('p') >>> from pprint import pprint >>> pprint(p.contents) [u'\n', <strong class="offender">YOB:</strong>, u' 1987', <br/>, u'\n', <strong class="offender">RACE:</strong>, u' WHITE', <br/>, u'\n', <strong class="offender">GENDER:</strong>, u' FEMALE', <br/>, u'\n', <strong class="offender">HEIGHT:</strong>, u" 5'05''", <br/>, u'\n', <strong class="offender">WEIGHT:</strong>, u' 118', <br/>, u'\n', <strong class="offender">EYE COLOR:</strong>, u' GREEN', <br/>, u'\n', <strong class="offender">HAIR COLOR:</strong>, u' BROWN', <br/>, u'\n']
and filter out the necessary items from the list:
>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]])) >>> pprint(data) {u'EYE COLOR:': u'GREEN', u'GENDER:': u'FEMALE', u'HAIR COLOR:': u'BROWN', u'HEIGHT:': u"5'05''", u'RACE:': u'WHITE', u'WEIGHT:': u'118', u'YOB:': u'1987'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With