Supposing I have an html string like this:
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
<a href="http://my.url/">a url</a>
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
I want to extract the content of d2
that is NOT wrapped by other tags, skipping a url
. In other words I want to get such result:
Text 2
Text 2 continue
Is there a way to do it with BeautifulSoup?
I tried this, but it is not correct:
soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find(id='d2').text
print(s)
Try with .find_all(text=True, recursive=False)
:
from bs4 import BeautifulSoup
div_test="""
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
<a href="http://my.url/">a url</a>
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
"""
soup = BeautifulSoup(div_test, 'lxml')
s = soup.find(id='d2').find_all(text=True, recursive=False)
print(s)
print([e.strip() for e in s]) #remove space
it will return a list
with only text
:
[u'\n Text 2\n ', u'\n Text 2 continue\n ']
[u'Text 2', u'Text 2 continue']
You can get only the NavigableString
objects with a simple list comprehension.
tag = soup.find(id='d2')
s = ''.join(e for e in tag if type(e) is bs4.element.NavigableString)
Alternatively you can use the decompose
method to delete all the child nodes, then get all the remaining items with text
.
tag = soup.find(id='d2')
for e in tag.find_all() :
e.decompose()
s = tag.text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With