Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup not extracting all html (automatically deleting much of a page's html)

I am trying to use BeautifulSoup to extract the contents from a website (http://brooklynexposed.com/events/). As an example of the problem I can run the following code:

import urllib
import bs4 as BeautifulSoup

url = 'http://brooklynexposed.com/events/'
html = urllib.urlopen(url).read()

soup = BeautifulSoup.BeautifulSoup(html)
print soup.prettify().encode('utf-8')

The output seems to cut off the html as follows:

       <li class="event">
        9:00pm - 11:00pm
        <br/>
        <a href="http://brooklynexposed.com/events/entry/5432/2013-07-16">
         Comedy Sh
        </a>
       </li>
      </ul>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>

It is cutting off the listing with the name Comedy Show along with all html that comes after until the final closing tags. Majority of the html is being automatically removed. I have noticed similar things on numerous website, that if the page is too long, BeautifulSoup fails to parse the entire page and just cuts out text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know any other libraries with functions similar to prettify()?

like image 889
user2540231 Avatar asked Jul 15 '13 17:07

user2540231


People also ask

What function in BeautifulSoup allows you to retrieve all instances of an HTML tag?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.

Does BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.


1 Answers

I had troubles that bs4 cuts html on some machines and on some not. It was not reproducable....

I switched to this:

soup = bs4.BeautifulSoup(html, 'html5lib')

.. and it works now.

like image 193
guettli Avatar answered Sep 19 '22 10:09

guettli