I am trying to use BeautifulSoup to extract the contents from a website (http://brooklynexposed.com/events/). As an example of the problem I can run the following code:
import urllib
import bs4 as BeautifulSoup
url = 'http://brooklynexposed.com/events/'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)
print soup.prettify().encode('utf-8')
The output seems to cut off the html as follows:
<li class="event">
9:00pm - 11:00pm
<br/>
<a href="http://brooklynexposed.com/events/entry/5432/2013-07-16">
Comedy Sh
</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</body>
</html>
It is cutting off the listing with the name Comedy Show along with all html that comes after until the final closing tags. Majority of the html is being automatically removed. I have noticed similar things on numerous website, that if the page is too long, BeautifulSoup fails to parse the entire page and just cuts out text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know any other libraries with functions similar to prettify()?
Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
I had troubles that bs4 cuts html on some machines and on some not. It was not reproducable....
I switched to this:
soup = bs4.BeautifulSoup(html, 'html5lib')
.. and it works now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With