Why does multithreading do not speed up parsing HTML with lxml?

Tags:

I am trying to understand why running multiple parsers in parallel threads does not speed up parsing HTML. One thread does 100 tasks twice as fast as two threads with 50 tasks each.

Here is my code:

from lxml.html import fromstring
import time
from threading import Thread
try:
    from urllib import urlopen
except ImportError:
    from urllib.request import urlopen

DATA = urlopen('http://lxml.de/FAQ.html').read()


def func(number):
    for x in range(number):
        fromstring(DATA)


print('Testing one thread (100 job per thread)')
start = time.time()
t1 = Thread(target=func, args=[100])
t1.start()
t1.join()
elapsed = time.time() - start
print('Time: %.5f' % elapsed)

print('Testing two threads (50 jobs per thread)')
start = time.time()
t1 = Thread(target=func, args=[50])
t2 = Thread(target=func, args=[50])
t1.start()
t2.start()
t1.join()
t2.join()
elapsed = time.time() - start
print('Time: %.5f' % elapsed)

Output on my 4 cores CPU machine:

Testing one thread (100 job per thread)
Time: 0.55351
Testing two threads (50 jobs per thread)
Time: 0.88461

According to the FAQ (http://lxml.de/FAQ.html#can-i-use-threads-to-concurrently-access-the-lxml-api) two threads should work faster than one thread.

Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself.

...

The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.

So, the question is why two threads are slower than one thread?

My environment: linux debian, lxml 3.3.5-1+b1, same results on python2 and python3

BTW, my friend tried to run this test on macos and got same timings for one and for two threads. Anyway, that is not as it supposed to be according to the documentation (two threads should be twice as fast).

UPD: Thanks to spectras. He pointed that it needs to create a parser in each thread. The updated code of the func function is:

from lxml.html import HTMLParser
from lxml.etree import parse

def func(number):
    parser = HTMLParser()
    for x in range(number):
        parse(StringIO(DATA), parser=parser)

The output is:

Testing one thread (100 jobs per thread)
Time: 0.53993
Testing two threads (50 jobs per thread)
Time: 0.28869

That is exactly what I wanted! :)

307

asked Aug 29 '15 11:08

Stack Exchange User

1 Answers

The documentation gives a good lead there: "as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself."

You're definitely not creating a parser for each thread. You can see that, if you do not specify a parser yourself, the fromstring function uses a global one.

Now for the other condition, you can see at the bottom of the file that html_parser is a subclass of lxml.etree.HTMLParser. With no special behavior and most importantly no thread local storage. I cannot test here but I would believe you end up sharing a parser across your two threads, which does not qualify as "default parser".

Could you try instanciating the parsers yourself and feeding them to fromstring? Or I'll do it in an hour or so and update this post.

def func(number):
    parser = HTMLParser()
    for x in range(number):
        fromstring(DATA, parser=parser)

answered Nov 15 '22 20:11

spectras

Related questions
                            
                                How can I get code coverage data from Python BDD functional tests using Behave?
                            
                                Is it OK to call __init__ from __setstate__
                            
                                Using an image for tick labels in matplotlib [duplicate]
                            
                                Read The Docs not working with automodule
                            
                                pandas partial join on multiindex
                            
                                Processing yEd graphml file in python
                            
                                Adding multiple constraints to scipy minimize, autogenerate constraint dictionary list?
                            
                                Performance comparison Fortran, Numpy,Cython and Numexpr
                            
                                using stdin in pycharm [duplicate]
                            
                                Pixelated animations in Matplotlib
                            
                                Is .pyc platform independent?
                            
                                Better errors message if template is missing
                            
                                Django's GeoJSON serializer not serializing all fields?
                            
                                How to uninstall all python versions and use the default system version of OS X 10.10?
                            
                                Word2vec training using gensim starts swapping after 100K sentences
                            
                                Mac OS X 10.10 Meld Error
                            
                                Mongo engine query the referencefield
                            
                                python pandas histogram plot including NaN values
                            
                                Optimization and speedup of a mathematical function in python
                            
                                Comparing two variables with 'is' operator which are declared in one line in Python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does multithreading do not speed up parsing HTML with lxml?

Tags:

performance

python

multithreading

lxml

gil

Stack Exchange User

People also ask

1 Answers

spectras

Recent Activity

Donate For Us