I am trying to retrieve few <p>
tags in the following html code. Here is only the part of it
<td class="eelantext">
<a class="fBlackLink"></a>
<center></center>
<span> … </span><br></br>
<table width="402" vspace="5" cellspacing="0" cellpadding="3"
border="0" bgcolor="#ffffff" align="Left">
<tbody> … </tbody></table>
<!--edstart-->
<p> … </p>
<p> … </p>
<p> … </p>
<p> … </p>
<p> … </p>
</td>
You can find the webpage here
My Python code is the following
soup = BeautifulSoup(page)
div = soup.find('td', attrs={'class': 'eelantext'})
print div
text = div.find_all('p')
But the text variable is empty and if I print the div variable, I have exactly the same html from above except the <p>
tags.
BeautifulSoup can use different parsers to handle HTML input. The HTML input here is a little broken, and the default HTMLParser
parser doesn't handle it very well.
Use the html5lib
parser instead:
>>> len(BeautifulSoup(r.text, 'html').find('td', attrs={'class': 'eelantext'}).find_all('p'))
0
>>> len(BeautifulSoup(r.text, 'lxml').find('td', attrs={'class': 'eelantext'}).find_all('p'))
0
>>> len(BeautifulSoup(r.text, 'html5lib').find('td', attrs={'class': 'eelantext'}).find_all('p'))
22
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With