I am trying to understand why running multiple parsers in parallel threads does not speed up parsing HTML. One thread does 100 tasks twice as fast as two threads with 50 tasks each.
Here is my code:
from lxml.html import fromstring
import time
from threading import Thread
try:
from urllib import urlopen
except ImportError:
from urllib.request import urlopen
DATA = urlopen('http://lxml.de/FAQ.html').read()
def func(number):
for x in range(number):
fromstring(DATA)
print('Testing one thread (100 job per thread)')
start = time.time()
t1 = Thread(target=func, args=[100])
t1.start()
t1.join()
elapsed = time.time() - start
print('Time: %.5f' % elapsed)
print('Testing two threads (50 jobs per thread)')
start = time.time()
t1 = Thread(target=func, args=[50])
t2 = Thread(target=func, args=[50])
t1.start()
t2.start()
t1.join()
t2.join()
elapsed = time.time() - start
print('Time: %.5f' % elapsed)
Output on my 4 cores CPU machine:
Testing one thread (100 job per thread)
Time: 0.55351
Testing two threads (50 jobs per thread)
Time: 0.88461
According to the FAQ (http://lxml.de/FAQ.html#can-i-use-threads-to-concurrently-access-the-lxml-api) two threads should work faster than one thread.
Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself.
...
The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.
So, the question is why two threads are slower than one thread?
My environment: linux debian, lxml 3.3.5-1+b1, same results on python2 and python3
BTW, my friend tried to run this test on macos and got same timings for one and for two threads. Anyway, that is not as it supposed to be according to the documentation (two threads should be twice as fast).
UPD: Thanks to spectras. He pointed that it needs to create a parser in each thread. The updated code of the func
function is:
from lxml.html import HTMLParser
from lxml.etree import parse
def func(number):
parser = HTMLParser()
for x in range(number):
parse(StringIO(DATA), parser=parser)
The output is:
Testing one thread (100 jobs per thread)
Time: 0.53993
Testing two threads (50 jobs per thread)
Time: 0.28869
That is exactly what I wanted! :)
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.
lxml can benefit from the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup, and convert_tree() to convert an existing BeautifulSoup tree into a list of top-level Elements.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
The documentation gives a good lead there: "as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself."
You're definitely not creating a parser for each thread. You can see that, if you do not specify a parser yourself, the fromstring
function uses a global one.
Now for the other condition, you can see at the bottom of the file that html_parser
is a subclass of lxml.etree.HTMLParser
. With no special behavior and most importantly no thread local storage. I cannot test here but I would believe you end up sharing a parser across your two threads, which does not qualify as "default parser".
Could you try instanciating the parsers yourself and feeding them to fromstring
? Or I'll do it in an hour or so and update this post.
def func(number):
parser = HTMLParser()
for x in range(number):
fromstring(DATA, parser=parser)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With