Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

beautifulsoup .get_text() is not specific enough for my HTML parsing

Given the HTML code below I want output just the text of the h1 but not the "Details about  ", which is the text of the span (which is encapsulated by the h1).

My current output gives:

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

I would like:

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Here is the HTML I am working with

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

Here is my current code:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

Note: I do not want to just truncate the string because I would like this code to have some re-usability. What would be best is some code that crops out any text that is bounded by the span.

like image 478
Rorschach Avatar asked Jul 16 '15 18:07

Rorschach


People also ask

Can BeautifulSoup handle broken HTML?

It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.


1 Answers

You can use extract() to remove all span tags:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
like image 105
Wiktor Stribiżew Avatar answered Oct 23 '22 05:10

Wiktor Stribiżew