Python BeautifulSoup get text from HTML

Question

I have some HTML code like this:

<p>aaa</p>bbb
<p>ccc</p>ddd

How can I get 'bbb' and 'ddd'?

RocketDonkey · Accepted Answer

You can read the subsequent sibling of each p tag (note this is very specific to this text, so hopefully it can be expanded to your situation):

In [1]: from bs4 import BeautifulSoup

In [2]: html = """\
   ...: <p>aaa</p>bbb
   ...: <p>ccc</p>ddd"""

In [3]: soup = BeautifulSoup(html)

In [4]: [p.next_sibling for p in soup.findAll('p')]
Out[4]: [u'bbb
', u'ddd']

This picks up the trailing newline, so you can strip it off if need be:

In [5]: [p.next_sibling.strip() for p in soup.findAll('p')]
Out[5]: [u'bbb', u'ddd']

The general idea is that you locate the tag(s) before your target text and then find the next sibling element, which should be your text.

Python BeautifulSoup get text from HTML

Tags:

python

html

beautifulsoup

se77en

1 Answers

RocketDonkey

Recent Activity

Donate For Us

Python BeautifulSoup get text from HTML

Tags:

python

html

beautifulsoup

se77en

1 Answers

RocketDonkey

Related questions

Recent Activity

Donate For Us