What's the latest and greatest for fast YAML parsing in Python? Syck is out of date and recommends using PyYaml, yet PyYaml is pretty slow, and suffers from the GIL problem:
>>> def xit(f, x):
import threading
for i in xrange(x):
threading.Thread(target=f).start()
>>> def stressit():
start = time.time()
res = yaml.load(open(path_to_11000_byte_yaml_file))
print "Took %.2fs" % (time.time() - start,)
>>> xit(stressit, 1)
Took 0.37s
>>> xit(stressit, 2)
Took 1.40s
Took 1.41s
>>> xit(stressit, 4)
Took 2.98s
Took 2.98s
Took 2.99s
Took 3.00s
Given my use case I can cache the parsed objects, but I'd still prefer a faster solution even for that.
The linked wiki page states after the warning "Use libyaml (c), and PyYaml (python)". Although the note does have a bad wikilink (should be PyYAML
not PyYaml
).
As for performance, depending on how you installed PyYAML you should have the CParser class available which implements a YAML parser written in optimized C. While I don't think this gets around the GIL issue, it is markedly faster. Here are a few cursory benchmarks I ran on my machine (AMD Athlon II X4 640, 3.0GHz, 8GB RAM):
First with the default pure-Python parser:
$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y)'
10 loops, best of 3: 405 msec per loop
With the CParser:
$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y, Loader=yaml.CLoader)'
10 loops, best of 3: 59.2 msec per loop
And, for comparison, with PyPy using the pure-Python parser.
$ pypy -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y)'
10 loops, best of 3: 101 msec per loop
For large.yaml
I just googled for "large yaml file" and came across this:
https://gist.github.com/nrh/667383/raw/1b3ba75c939f2886f63291528df89418621548fd/large.yaml
(I had to remove the first couple of lines to make it a single-doc YAML file otherwise yaml.load complains.)
EDIT:
Another thing to consider is using the multiprocessing
module instead of threads. This gets around GIL problems, but does require a bit more boiler-plate code to communicate between the processes. There are a number of good libraries available though to make multiprocessing easier. There's a pretty good list of them here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With