Given the HTML code below I want output just the text of the h1 but not the "Details about ", which is the text of the span (which is encapsulated by the h1).
My current output gives:
Details about New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
I would like:
New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
Here is the HTML I am working with
<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>
Here is my current code:
for line in soup.find_all('h1',attrs={'itemprop':'name'}):
print line.get_text()
Note: I do not want to just truncate the string because I would like this code to have some re-usability. What would be best is some code that crops out any text that is bounded by the span.
It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.
You can use extract()
to remove all span
tags:
for line in soup.find_all('h1',attrs={'itemprop':'name'}):
[s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With