<p>I am trying to use BeautifulSoup to extract the contents from a website (http://brooklynexposed.com/events/). As an example of the problem I can run the following code:</p> <pre class="prettyprint"><code>import urllib import bs4 as BeautifulSoup url = 'http://brooklynexposed.com/events/' html = urllib.urlopen(url).read() soup = BeautifulSoup.BeautifulSoup(html) print soup.prettify().encode('utf-8') </code></pre> <p>The output seems to cut off the html as follows:</p> <pre class="prettyprint"><code> <li class="event"> 9:00pm - 11:00pm <br/> <a href="http://brooklynexposed.com/events/entry/5432/2013-07-16"> Comedy Sh </a> </li> </ul> </div> </div> </div> </div> </body> </html> </code></pre> <p>It is cutting off the listing with the name Comedy Show along with all html that comes after until the final closing tags. Majority of the html is being automatically removed. I have noticed similar things on numerous website, that if the page is too long, BeautifulSoup fails to parse the entire page and just cuts out text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know any other libraries with functions similar to prettify()?</p>

<p>I had troubles that bs4 cuts html on some machines and on some not. It was not reproducable....</p> <p>I switched to this:</p> <pre class="prettyprint"><code>soup = bs4.BeautifulSoup(html, 'html5lib') </code></pre> <p>.. and it works now. </p>

BeautifulSoup not extracting all html (automatically deleting much of a page's html)

Tags:

python

beautifulsoup

urllib

I am trying to use BeautifulSoup to extract the contents from a website (http://brooklynexposed.com/events/). As an example of the problem I can run the following code:

import urllib
import bs4 as BeautifulSoup

url = 'http://brooklynexposed.com/events/'
html = urllib.urlopen(url).read()

soup = BeautifulSoup.BeautifulSoup(html)
print soup.prettify().encode('utf-8')

The output seems to cut off the html as follows:

       <li class="event">
        9:00pm - 11:00pm
        <br/>
        <a href="http://brooklynexposed.com/events/entry/5432/2013-07-16">
         Comedy Sh
        </a>
       </li>
      </ul>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>

It is cutting off the listing with the name Comedy Show along with all html that comes after until the final closing tags. Majority of the html is being automatically removed. I have noticed similar things on numerous website, that if the page is too long, BeautifulSoup fails to parse the entire page and just cuts out text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know any other libraries with functions similar to prettify()?

889

asked Jul 15 '13 17:07

user2540231

1 Answers

I had troubles that bs4 cuts html on some machines and on some not. It was not reproducable....

I switched to this:

soup = bs4.BeautifulSoup(html, 'html5lib')

.. and it works now.

193

answered Sep 19 '22 10:09

guettli

Related questions
                            
                                Calling Inkscape in Python [duplicate]
                            
                                Clean way to get near-LIFO behavior from multiprocessing.Queue? (or even just *not* near-FIFO)
                            
                                python - Class attributes apparently not inherited
                            
                                [py.test]: test dependencies
                            
                                GitPython equivalent of "git remote show origin"?
                            
                                Simplifying logging in Flask
                            
                                Use of re.MULTILINE and re.DOTALL together python
                            
                                Sphinx documentation processor extension works differently for HTML and LaTeX output?
                            
                                How to find the containing class of a decorated method in Python
                            
                                Pre-signed URLs and x-amz-acl
                            
                                How to create a virtualenv by cloning the current local environment?
                            
                                Block mean of numpy 2D array
                            
                                Hide / Invisible Matplotlib figure
                            
                                How to install npm package from python script?
                            
                                Printed length of a string in python
                            
                                Understanding Python fork and memory allocation errors
                            
                                Is there a way to suppress unresolved imports in eclipse in a PyDev project?
                            
                                How to create a python package with multiple files without subpackages
                            
                                What python 3 library should I use for MySQL?
                            
                                Django Wizard, multiple forms in one step

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With