I'm trying to scrape all the inner html from the <code></code> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text. For example, for: <pre class="prettyprint"><code>Red Blue Yellow Light green </code></pre> How can I extract: <pre class="prettyprint"><code>Red Blue Yellow Light green </code></pre> Neither <code>.string</code> nor <code>.contents[0]</code> does what I need. Nor does <code>.extract()</code>, because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur. Is there a 'just get the visible HTML' type of method in BeautifulSoup? ----UPDATE------ On advice, trying: <pre class="prettyprint"><code>soup = BeautifulSoup(open("test.html")) p_tags = soup.findAll('p',text=True) for i, p_tag in enumerate(p_tags): print str(i) + p_tag </code></pre> But that doesn't help - it prints out: <pre class="prettyprint"><code>0Red 1 2Blue 3 4Yellow 5 6Light 7green 8 </code></pre>

Short answer: <code>soup.findAll(text=True)</code> This has already been answered, here on StackOverflow and in the BeautifulSoup documentation. UPDATE: To clarify, a working piece of code: <pre class="prettyprint"><code>>>> txt = """\ ... Red ... Blue ... Yellow ... Light green ... """ >>> import BeautifulSoup >>> BeautifulSoup.__version__ '3.0.7a' >>> soup = BeautifulSoup.BeautifulSoup(txt) >>> for node in soup.findAll('p'): ... print ''.join(node.findAll(text=True)) Red Blue Yellow Light green </code></pre>

The accepted answer is great but it is 6 years old now, so here's the current Beautiful Soup 4 version of this answer: <pre class="prettyprint"><code>>>> txt = """\ Red Blue Yellow Light green """ >>> from bs4 import BeautifulSoup, __version__ >>> __version__ '4.5.1' >>> soup = BeautifulSoup(txt, "html.parser") >>> print("".join(soup.strings)) Red Blue Yellow Light green </code></pre>

I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out. <pre class="prettyprint"><code># importing the modules from bs4 import BeautifulSoup from urllib.request import urlopen # setting up your BeautifulSoup Object webpage = urlopen("https://insertyourwebpage.com") soup = BeautifulSoup( webpage.read(), features="lxml") p_tags = soup.find_all('p') for each in p_tags: print (str(each.get_text())) </code></pre> Notice that we're first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text. Also: <ul> <li>it is better to use the updated 'find_all()' in bs4 than the older findAll() </li> <li>urllib2 was replaced by urllib.request and urllib.error, see here </li> </ul> Now your output should be: <ul> <li>Red</li> <li>Blue</li> <li>Yellow</li> <li>Light</li> </ul> Hope this helps someone looking for an updated solution.

BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

Tags:

python

beautifulsoup

I'm trying to scrape all the inner html from the  elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.

For example, for:

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

How can I extract:

Red
Blue
Yellow
Light green

Neither .string nor .contents[0] does what I need. Nor does .extract(), because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.

Is there a 'just get the visible HTML' type of method in BeautifulSoup?

----UPDATE------

On advice, trying:

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

But that doesn't help - it prints out:

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8

356

asked Jun 02 '10 10:06

AP257

3 Answers

Short answer: soup.findAll(text=True)

This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.

UPDATE:

To clarify, a working piece of code:

>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

195

answered Oct 26 '22 05:10

taleinat

The accepted answer is great but it is 6 years old now, so here's the current Beautiful Soup 4 version of this answer:

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green

answered Oct 26 '22 07:10

Jaymon

I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out.

# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen

# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')


for each in p_tags: 
    print (str(each.get_text()))

Notice that we're first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text.

Also:

it is better to use the updated 'find_all()' in bs4 than the older findAll()
urllib2 was replaced by urllib.request and urllib.error, see here

Now your output should be:

Red
Blue
Yellow
Light

Hope this helps someone looking for an updated solution.

answered Oct 26 '22 06:10

erdin

Related questions
                            
                                Make (install from source) python without running tests
                            
                                Socket.IO Client Library in Python [closed]
                            
                                Why should I close files in Python? [duplicate]
                            
                                Why can't I suppress numpy warnings
                            
                                Get screenshot on Windows with Python?
                            
                                Is it possible to kill a process on Windows from within Python?
                            
                                How do I do Debian packaging of a Python package?
                            
                                Add quotes to every list element
                            
                                python - os.getenv and os.environ don't see environment variables of my bash shell
                            
                                In Python, how can I put a thread to sleep until a specific time?
                            
                                Method Not Allowed flask error 405
                            
                                No module named 'virtualenvwrapper'
                            
                                Get all modules/packages used by a python project
                            
                                Exporting Data from google colab to local machine
                            
                                Mini-languages in Python
                            
                                The inheritance of attributes using __init__
                            
                                Adding 'install_requires' to setup.py when making a python package
                            
                                Generating a dense matrix from a sparse matrix in numpy python
                            
                                Python saving multiple figures into one PDF file
                            
                                Matrix from Python to MATLAB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With