Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using BeautifulSoup to extract text without tags

My webpage looks like this:

<p>   <strong class="offender">YOB:</strong> 1987<br/>   <strong class="offender">RACE:</strong> WHITE<br/>   <strong class="offender">GENDER:</strong> FEMALE<br/>   <strong class="offender">HEIGHT:</strong> 5'05''<br/>   <strong class="offender">WEIGHT:</strong> 118<br/>   <strong class="offender">EYE COLOR:</strong> GREEN<br/>   <strong class="offender">HAIR COLOR:</strong> BROWN<br/> </p> 

I want to extract the info for each individual and get YOB:1987, RACE:WHITE, etc...

What I tried is:

subc = soup.find_all('p') subc1 = subc[1] subc2 = subc1.find_all('strong') 

But this gives me only the values of YOB:, RACE:, etc...

Is there a way that I can get the data in YOB:1987, RACE:WHITE format?

like image 366
myloginid Avatar asked Apr 30 '14 05:04

myloginid


People also ask

How do I exclude tags in BeautifulSoup?

Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.


2 Answers

Just loop through all the <strong> tags and use next_sibling to get what you want. Like this:

for strong_tag in soup.find_all('strong'):     print(strong_tag.text, strong_tag.next_sibling) 

Demo:

from bs4 import BeautifulSoup  html = ''' <p>   <strong class="offender">YOB:</strong> 1987<br />   <strong class="offender">RACE:</strong> WHITE<br />   <strong class="offender">GENDER:</strong> FEMALE<br />   <strong class="offender">HEIGHT:</strong> 5'05''<br />   <strong class="offender">WEIGHT:</strong> 118<br />   <strong class="offender">EYE COLOR:</strong> GREEN<br />   <strong class="offender">HAIR COLOR:</strong> BROWN<br /> </p> '''  soup = BeautifulSoup(html)  for strong_tag in soup.find_all('strong'):     print(strong_tag.text, strong_tag.next_sibling) 

This gives you:

YOB:  1987 RACE:  WHITE GENDER:  FEMALE HEIGHT:  5'05'' WEIGHT:  118 EYE COLOR:  GREEN HAIR COLOR:  BROWN 
like image 87
shaktimaan Avatar answered Sep 24 '22 18:09

shaktimaan


I think you can get it using subc1.text.

>>> html = """ <p>     <strong class="offender">YOB:</strong> 1987<br />     <strong class="offender">RACE:</strong> WHITE<br />     <strong class="offender">GENDER:</strong> FEMALE<br />     <strong class="offender">HEIGHT:</strong> 5'05''<br />     <strong class="offender">WEIGHT:</strong> 118<br />     <strong class="offender">EYE COLOR:</strong> GREEN<br />     <strong class="offender">HAIR COLOR:</strong> BROWN<br /> </p> """ >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> print soup.text   YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN 

Or if you want to explore it, you can use .contents :

>>> p = soup.find('p') >>> from pprint import pprint >>> pprint(p.contents) [u'\n',  <strong class="offender">YOB:</strong>,  u' 1987',  <br/>,  u'\n',  <strong class="offender">RACE:</strong>,  u' WHITE',  <br/>,  u'\n',  <strong class="offender">GENDER:</strong>,  u' FEMALE',  <br/>,  u'\n',  <strong class="offender">HEIGHT:</strong>,  u" 5'05''",  <br/>,  u'\n',  <strong class="offender">WEIGHT:</strong>,  u' 118',  <br/>,  u'\n',  <strong class="offender">EYE COLOR:</strong>,  u' GREEN',  <br/>,  u'\n',  <strong class="offender">HAIR COLOR:</strong>,  u' BROWN',  <br/>,  u'\n'] 

and filter out the necessary items from the list:

>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]])) >>> pprint(data) {u'EYE COLOR:': u'GREEN',  u'GENDER:': u'FEMALE',  u'HAIR COLOR:': u'BROWN',  u'HEIGHT:': u"5'05''",  u'RACE:': u'WHITE',  u'WEIGHT:': u'118',  u'YOB:': u'1987'} 
like image 31
Sufian Latif Avatar answered Sep 22 '22 18:09

Sufian Latif