Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python BeautifulSoup get text from HTML

I have some HTML code like this:

<p>aaa</p>bbb
<p>ccc</p>ddd

How can I get 'bbb' and 'ddd'?

like image 259
se77en Avatar asked May 08 '26 19:05

se77en


1 Answers

You can read the subsequent sibling of each p tag (note this is very specific to this text, so hopefully it can be expanded to your situation):

In [1]: from bs4 import BeautifulSoup

In [2]: html = """\
   ...: <p>aaa</p>bbb
   ...: <p>ccc</p>ddd"""

In [3]: soup = BeautifulSoup(html)

In [4]: [p.next_sibling for p in soup.findAll('p')]
Out[4]: [u'bbb\n', u'ddd']

This picks up the trailing newline, so you can strip it off if need be:

In [5]: [p.next_sibling.strip() for p in soup.findAll('p')]
Out[5]: [u'bbb', u'ddd']

The general idea is that you locate the tag(s) before your target text and then find the next sibling element, which should be your text.

like image 75
RocketDonkey Avatar answered May 10 '26 08:05

RocketDonkey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!