Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup hangs when using find

I have a problem with the bs4 package.

I have a html document, like this one:

data = """<html><head></head><body>
<p> this is tab </p>
<img src="image.jpg">
</body></html>
"""

This is my code:

from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html5lib')
soup.find_all("a")

When I run it, bs4 is still in a loop and doesn't return anything, maybe because in some HTML data the tag a doesn't exist.

Many thanks.
1. Yes, above example working correctly.
2. But, in my case. data is a variable with multi lines html string

from bs4 import BeautifulSoup
data = open("file.htm").read()
soup = BeautifulSoup(data, 'html5lib')
soup.find_all("a")

3. Please test with my file: file.htm
4. I'm using beautifulsoup4==4.4.1. Python 3.5.1
5. Thanks again.

like image 658
Nguyễn Diễn Avatar asked Oct 25 '25 20:10

Nguyễn Diễn


2 Answers

Try using the builtin html.parser, it works even with invalid HTML.

from bs4 import BeautifulSoup

data = """<html><head></head><body>
<p> this is tab </p>
<img src="image.jpg">
</body></html>
"""

soup = BeautifulSoup(data, 'html.parser')
soup.find_all("a")
like image 182
Dušan Maďar Avatar answered Oct 28 '25 08:10

Dušan Maďar


I don't see why would your program hangs when using find_all, it might take a while if the html page is large but it shouldn't hang.

Here are a few things you can try:

  • If you are downloading the web page prior to parsing it, that might cause the hanging. Use pdb to detect where exactly the program hangs, add this line to the start of your code import pdb; pdb.set_trace() and track it from there

  • Make sure you installed Html5Lib by running pip freeze | grep html5lib, if it doesn't exist install it with pip install html5lib

  • In a similar SO question, someone mentioned they got it fixed by upgrading BeautifulSoup, try that with: pip install --upgrade beautifulsoup4

In the BeautifulSoup doc, they recommend using specific parsers with certain Python versions:

If you can, I recommend you install and use lxml for speed.
If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib – Python’s built-in HTML parser is just not very good in older versions.

like image 32
Forge Avatar answered Oct 28 '25 08:10

Forge