My webpage looks like this: <pre class="prettyprint lang-html prettyprint-override"><code> YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN </code></pre> I want to extract the info for each individual and get <code>YOB:1987</code>, <code>RACE:WHITE</code>, etc... What I tried is: <pre class="prettyprint"><code>subc = soup.find_all('p') subc1 = subc[1] subc2 = subc1.find_all('strong') </code></pre> But this gives me only the values of <code>YOB:</code>, <code>RACE:</code>, etc... Is there a way that I can get the data in <code>YOB:1987</code>, <code>RACE:WHITE</code> format?

Just loop through all the <code></code> tags and use <code>next_sibling</code> to get what you want. Like this: <pre class="prettyprint"><code>for strong_tag in soup.find_all('strong'): print(strong_tag.text, strong_tag.next_sibling) </code></pre> Demo: <pre class="prettyprint"><code>from bs4 import BeautifulSoup html = ''' YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN ''' soup = BeautifulSoup(html) for strong_tag in soup.find_all('strong'): print(strong_tag.text, strong_tag.next_sibling) </code></pre> This gives you: <pre class="prettyprint lang-none prettyprint-override"><code>YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN </code></pre>

I think you can get it using <code>subc1.text</code>. <pre class="prettyprint"><code>>>> html = """ YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN """ >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> print soup.text YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN </code></pre> Or if you want to explore it, you can use <code>.contents</code> : <pre class="prettyprint"><code>>>> p = soup.find('p') >>> from pprint import pprint >>> pprint(p.contents) [u'\n', YOB:, u' 1987', , u'\n', RACE:, u' WHITE', , u'\n', GENDER:, u' FEMALE', , u'\n', HEIGHT:, u" 5'05''", , u'\n', WEIGHT:, u' 118', , u'\n', EYE COLOR:, u' GREEN', , u'\n', HAIR COLOR:, u' BROWN', , u'\n'] </code></pre> and filter out the necessary items from the list: <pre class="prettyprint"><code>>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]])) >>> pprint(data) {u'EYE COLOR:': u'GREEN', u'GENDER:': u'FEMALE', u'HAIR COLOR:': u'BROWN', u'HEIGHT:': u"5'05''", u'RACE:': u'WHITE', u'WEIGHT:': u'118', u'YOB:': u'1987'} </code></pre>

Using BeautifulSoup to extract text without tags

<p>   <strong class="offender">YOB:</strong> 1987<br/>   <strong class="offender">RACE:</strong> WHITE<br/>   <strong class="offender">GENDER:</strong> FEMALE<br/>   <strong class="offender">HEIGHT:</strong> 5'05''<br/>   <strong class="offender">WEIGHT:</strong> 118<br/>   <strong class="offender">EYE COLOR:</strong> GREEN<br/>   <strong class="offender">HAIR COLOR:</strong> BROWN<br/> </p>

I want to extract the info for each individual and get YOB:1987, RACE:WHITE, etc...

What I tried is:

subc = soup.find_all('p') subc1 = subc[1] subc2 = subc1.find_all('strong')

But this gives me only the values of YOB:, RACE:, etc...

Is there a way that I can get the data in YOB:1987, RACE:WHITE format?

366

asked Apr 30 '14 05:04

myloginid

2 Answers

Just loop through all the  tags and use next_sibling to get what you want. Like this:

for strong_tag in soup.find_all('strong'):     print(strong_tag.text, strong_tag.next_sibling)

Demo:

from bs4 import BeautifulSoup  html = ''' <p>   <strong class="offender">YOB:</strong> 1987<br />   <strong class="offender">RACE:</strong> WHITE<br />   <strong class="offender">GENDER:</strong> FEMALE<br />   <strong class="offender">HEIGHT:</strong> 5'05''<br />   <strong class="offender">WEIGHT:</strong> 118<br />   <strong class="offender">EYE COLOR:</strong> GREEN<br />   <strong class="offender">HAIR COLOR:</strong> BROWN<br /> </p> '''  soup = BeautifulSoup(html)  for strong_tag in soup.find_all('strong'):     print(strong_tag.text, strong_tag.next_sibling)

This gives you:

YOB:  1987 RACE:  WHITE GENDER:  FEMALE HEIGHT:  5'05'' WEIGHT:  118 EYE COLOR:  GREEN HAIR COLOR:  BROWN

answered Sep 24 '22 18:09

shaktimaan

I think you can get it using subc1.text.

>>> html = """ <p>     <strong class="offender">YOB:</strong> 1987<br />     <strong class="offender">RACE:</strong> WHITE<br />     <strong class="offender">GENDER:</strong> FEMALE<br />     <strong class="offender">HEIGHT:</strong> 5'05''<br />     <strong class="offender">WEIGHT:</strong> 118<br />     <strong class="offender">EYE COLOR:</strong> GREEN<br />     <strong class="offender">HAIR COLOR:</strong> BROWN<br /> </p> """ >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> print soup.text   YOB: 1987 RACE: WHITE GENDER: FEMALE HEIGHT: 5'05'' WEIGHT: 118 EYE COLOR: GREEN HAIR COLOR: BROWN

Or if you want to explore it, you can use .contents :

>>> p = soup.find('p') >>> from pprint import pprint >>> pprint(p.contents) [u'\n',  <strong class="offender">YOB:</strong>,  u' 1987',  <br/>,  u'\n',  <strong class="offender">RACE:</strong>,  u' WHITE',  <br/>,  u'\n',  <strong class="offender">GENDER:</strong>,  u' FEMALE',  <br/>,  u'\n',  <strong class="offender">HEIGHT:</strong>,  u" 5'05''",  <br/>,  u'\n',  <strong class="offender">WEIGHT:</strong>,  u' 118',  <br/>,  u'\n',  <strong class="offender">EYE COLOR:</strong>,  u' GREEN',  <br/>,  u'\n',  <strong class="offender">HAIR COLOR:</strong>,  u' BROWN',  <br/>,  u'\n']

and filter out the necessary items from the list:

>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]])) >>> pprint(data) {u'EYE COLOR:': u'GREEN',  u'GENDER:': u'FEMALE',  u'HAIR COLOR:': u'BROWN',  u'HEIGHT:': u"5'05''",  u'RACE:': u'WHITE',  u'WEIGHT:': u'118',  u'YOB:': u'1987'}

answered Sep 22 '22 18:09

Sufian Latif

Related questions
                            
                                Get meta tag content property with BeautifulSoup and Python
                            
                                Python super method and calling alternatives
                            
                                Multithreaded web server in python
                            
                                Pandas convert a column of list to dummies
                            
                                Extract files from zip without keeping the structure using python ZipFile?
                            
                                django: return string from view
                            
                                zip file and avoid directory structure
                            
                                Name not defined in type annotation [duplicate]
                            
                                set matplotlib 3d plot aspect ratio
                            
                                How do I get Python's ElementTree to pretty print to an XML file?
                            
                                _pickle in python3 doesn't work for large data saving
                            
                                Is there a way to circumvent Python list.append() becoming progressively slower in a loop as the list grows?
                            
                                ImageMagick not authorized to convert PDF to an image
                            
                                How to understand numpy strides for layman?
                            
                                python list comprehension to produce two values in one iteration
                            
                                Creating an element-wise minimum Series from two other Series in Python Pandas
                            
                                Drag and drop onto Python script in Windows Explorer
                            
                                What's the simplest way to extend a numpy array in 2 dimensions?
                            
                                Passing an array/list into a Python function
                            
                                tell pip to install the dependencies of packages listed in a requirement file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using BeautifulSoup to extract text without tags

Tags:

python

beautifulsoup

web-scraping

myloginid

People also ask

2 Answers

shaktimaan

Sufian Latif

Recent Activity

Donate For Us