I'm trying to scrape all the inner html from the <p>
elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.
For example, for:
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
How can I extract:
Red
Blue
Yellow
Light green
Neither .string
nor .contents[0]
does what I need. Nor does .extract()
, because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.
Is there a 'just get the visible HTML' type of method in BeautifulSoup?
----UPDATE------
On advice, trying:
soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags):
print str(i) + p_tag
But that doesn't help - it prints out:
0Red
1
2Blue
3
4Yellow
5
6Light
7green
8
Step-by-step ApproachStep 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
Short answer: soup.findAll(text=True)
This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.
UPDATE:
To clarify, a working piece of code:
>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
... print ''.join(node.findAll(text=True))
Red
Blue
Yellow
Light green
The accepted answer is great but it is 6 years old now, so here's the current Beautiful Soup 4 version of this answer:
>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))
Red
Blue
Yellow
Light green
I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out.
# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen
# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')
for each in p_tags:
print (str(each.get_text()))
Notice that we're first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text.
Also:
Now your output should be:
Hope this helps someone looking for an updated solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With