Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

I'm trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.

For example, for:

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

How can I extract:

Red
Blue
Yellow
Light green

Neither .string nor .contents[0] does what I need. Nor does .extract(), because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.

Is there a 'just get the visible HTML' type of method in BeautifulSoup?

----UPDATE------

On advice, trying:

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

But that doesn't help - it prints out:

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8
like image 356
AP257 Avatar asked Jun 02 '10 10:06

AP257


People also ask

How do you scrape nested tags with BeautifulSoup?

Step-by-step ApproachStep 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.


3 Answers

Short answer: soup.findAll(text=True)

This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.

UPDATE:

To clarify, a working piece of code:

>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green
like image 195
taleinat Avatar answered Oct 26 '22 05:10

taleinat


The accepted answer is great but it is 6 years old now, so here's the current Beautiful Soup 4 version of this answer:

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green
like image 36
Jaymon Avatar answered Oct 26 '22 07:10

Jaymon


I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out.

# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen

# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')


for each in p_tags: 
    print (str(each.get_text()))

Notice that we're first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text.

Also:

  • it is better to use the updated 'find_all()' in bs4 than the older findAll()
  • urllib2 was replaced by urllib.request and urllib.error, see here

Now your output should be:

  • Red
  • Blue
  • Yellow
  • Light

Hope this helps someone looking for an updated solution.

like image 29
erdin Avatar answered Oct 26 '22 06:10

erdin