I am migrating some parsers from BeautifulSoup3 to BeautifulSoup4 and I thought it would be a good idea to profile how faster it would get considering that lxml is super fast and it's the parser I am using with BS4, here are the profile results:
For BS3:
43208 function calls (42654 primitive calls) in 0.103 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:2(<module>)
18 0.000 0.000 0.000 0.000 <string>:8(__new__)
1 0.000 0.000 0.072 0.072 <string>:9(parser)
32 0.000 0.000 0.000 0.000 BeautifulSoup.py:1012(__init__)
1 0.000 0.000 0.000 0.000 BeautifulSoup.py:1018(buildTagMap)
...
For BS4 using lxml:
164440 function calls (163947 primitive calls) in 0.244 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.040 0.040 0.069 0.069 <string>:2(<module>)
18 0.000 0.000 0.000 0.000 <string>:8(__new__)
1 0.000 0.000 0.158 0.158 <string>:9(parser)
1 0.000 0.000 0.008 0.008 HTMLParser.py:1(<module>)
1 0.000 0.000 0.000 0.000 HTMLParser.py:54(HTMLParseError)
...
why BS4
is calling 4 times more functions? why is it using the HTMLParser
at all if I set it to use lxml
?
The most noticeable things I changed from BS3 to BS4 were this:
BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES) --->
BeautifulSoup(html, 'lxml')
[x.getText('**SEP**') for x in i.findChildren('font')[:2]] --->
[x.getText('**SEP**', strip=True) for x in i.findChildren('font')[:2]]
everything else is just some name changes (like findParent --> find_parent)
EDIT:
my environment:
python 2.7.3
beautifulsoup4==4.1.0
lxml==2.3.4
EDIT 2:
Here is a small code sample to try it out:
from cProfile import Profile
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup as BS4
import urllib2
def parse(html):
soup = BS4(html, 'lxml')
hl = soup.find_all('span', {'class': 'mw-headline'})
return [x.get_text(strip=True) for x in hl]
def parse3(html):
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
hl = soup.findAll('span', {'class': 'mw-headline'})
return [x.getText() for x in hl]
if __name__ == "__main__":
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
html = ''.join(opener.open('http://en.wikipedia.org/wiki/Price').readlines())
profiler = Profile()
print profiler.runcall(parse, html)
profiler.print_stats()
profiler2 = Profile()
print profiler2.runcall(parse3, html)
profiler2.print_stats()
I believe the main problem is a bug in Beautiful Soup 4. I've filed it and a fix will be released in the next version. Thanks for finding this.
That said, I have no idea why your profile mentions the HTMLParser class at all, given that you're using lxml.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With