Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an up-to-date fast YAML parser with python bindings?

What's the latest and greatest for fast YAML parsing in Python? Syck is out of date and recommends using PyYaml, yet PyYaml is pretty slow, and suffers from the GIL problem:

>>> def xit(f, x):
        import threading
        for i in xrange(x):
                threading.Thread(target=f).start()

>>> def stressit():
        start = time.time()
        res = yaml.load(open(path_to_11000_byte_yaml_file))
        print "Took %.2fs" % (time.time() - start,)    

>>> xit(stressit, 1)
Took 0.37s
>>> xit(stressit, 2)
Took 1.40s
Took 1.41s
>>> xit(stressit, 4)
Took 2.98s
Took 2.98s
Took 2.99s
Took 3.00s

Given my use case I can cache the parsed objects, but I'd still prefer a faster solution even for that.

like image 574
Claudiu Avatar asked May 31 '13 19:05

Claudiu


1 Answers

The linked wiki page states after the warning "Use libyaml (c), and PyYaml (python)". Although the note does have a bad wikilink (should be PyYAML not PyYaml).

As for performance, depending on how you installed PyYAML you should have the CParser class available which implements a YAML parser written in optimized C. While I don't think this gets around the GIL issue, it is markedly faster. Here are a few cursory benchmarks I ran on my machine (AMD Athlon II X4 640, 3.0GHz, 8GB RAM):

First with the default pure-Python parser:

$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
    'yaml.load(y)'                    
10 loops, best of 3: 405 msec per loop

With the CParser:

$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
    'yaml.load(y, Loader=yaml.CLoader)'
10 loops, best of 3: 59.2 msec per loop

And, for comparison, with PyPy using the pure-Python parser.

$ pypy -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
    'yaml.load(y)'
10 loops, best of 3: 101 msec per loop

For large.yaml I just googled for "large yaml file" and came across this:

https://gist.github.com/nrh/667383/raw/1b3ba75c939f2886f63291528df89418621548fd/large.yaml

(I had to remove the first couple of lines to make it a single-doc YAML file otherwise yaml.load complains.)

EDIT:

Another thing to consider is using the multiprocessing module instead of threads. This gets around GIL problems, but does require a bit more boiler-plate code to communicate between the processes. There are a number of good libraries available though to make multiprocessing easier. There's a pretty good list of them here.

like image 199
Isaac Freeman Avatar answered Oct 11 '22 23:10

Isaac Freeman