Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I speedup YAML?

Tags:

I made a little test case to compare YAML and JSON speed :

import json import yaml from datetime import datetime from random import randint  NB_ROW=1024  print 'Does yaml is using libyaml ? ',yaml.__with_libyaml__ and 'yes' or 'no'  dummy_data = [ { 'dummy_key_A_%s' % i: i, 'dummy_key_B_%s' % i: i } for i in xrange(NB_ROW) ]   with open('perf_json_yaml.yaml','w') as fh:     t1 = datetime.now()     yaml.safe_dump(dummy_data, fh, encoding='utf-8', default_flow_style=False)     t2 = datetime.now()     dty = (t2 - t1).total_seconds()     print 'Dumping %s row into a yaml file : %s' % (NB_ROW,dty)  with open('perf_json_yaml.json','w') as fh:     t1 = datetime.now()     json.dump(dummy_data,fh)     t2 = datetime.now()     dtj = (t2 - t1).total_seconds()     print 'Dumping %s row into a json file : %s' % (NB_ROW,dtj)  print "json is %dx faster for dumping" % (dty/dtj)  with open('perf_json_yaml.yaml') as fh:     t1 = datetime.now()     data = yaml.safe_load(fh)     t2 = datetime.now()     dty = (t2 - t1).total_seconds()     print 'Loading %s row from a yaml file : %s' % (NB_ROW,dty)  with open('perf_json_yaml.json') as fh:     t1 = datetime.now()     data = json.load(fh)     t2 = datetime.now()     dtj = (t2 - t1).total_seconds()     print 'Loading %s row into from json file : %s' % (NB_ROW,dtj)  print "json is %dx faster for loading" % (dty/dtj) 

And the result is :

Does yaml is using libyaml ?  yes Dumping 1024 row into a yaml file : 0.251139 Dumping 1024 row into a json file : 0.007725 json is 32x faster for dumping Loading 1024 row from a yaml file : 0.401224 Loading 1024 row into from json file : 0.001793 json is 223x faster for loading 

I am using PyYAML 3.11 with libyaml C library on ubuntu 12.04. I know that json is much more simple than yaml, but with a 223x ratio between json and yaml I am wondering whether my configuration is correct or not.

Do you have same speed ratio ?
How can I speed up yaml.load() ?

like image 966
Eric Avatar asked Jan 02 '15 14:01

Eric


People also ask

Is YAML faster than JSON?

JSON is comparatively faster than YAML. However, if data configurations are small then YAML is better since its interface is much more friendly. JSON has a feature to encode six different data types like an object, array, strings, numbers, null and boolean.

What is YAML Safe_load?

Loading a YAML Document Safely Using safe_load() safe_load(stream) Parses the given and returns a Python object constructed from the first document in the stream. safe_load recognizes only standard YAML tags and cannot construct an arbitrary Python object.

Does YAML support JSON?

Although YAML looks different to JSON, YAML is a superset of JSON. As a superset of JSON, a valid YAML file can contain JSON. Additionally, JSON can transform into YAML as well. YAML itself can also contain JSON in its configuration files.

How do you pass a variable in YAML?

Passing variables between tasks in the same job For example, to pass the variable FOO between scripts: Set the value with the command echo "##vso[task. setvariable variable=FOO]some value" In subsequent tasks, you can use the $(FOO) syntax to have Azure Pipelines replace the variable with some value.


2 Answers

You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.

What's happening is Python's json library encodes Python's builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).

On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.

The same kind of stuff has to happen backwards when loading.

The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:

YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.

-- UPDATE

What I said before remains true, but if you're running Linux there's a way to speed up Yaml parsing. By default, Python's yaml uses the Python parser. You have to tell it that you want to use PyYaml C parser.

You can do it this way:

import yaml from yaml import CLoader as Loader, CDumper as Dumper  dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper) data = yaml.load(fh, Loader=Loader) 

In order to do so, you need yaml-cpp-dev (package later renamed to libyaml-cpp-dev) installed, for instance with apt-get:

$ apt-get install yaml-cpp-dev 

And PyYaml with LibYaml as well. But that's already the case based on your output.

I can't test it right now because I'm running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.

like image 138
Jivan Avatar answered Oct 10 '22 13:10

Jivan


For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:

enter image description here


Code to reproduce the plot:

import numpy import perfplot  import json import ujson import orjson import toml import yaml from yaml import Loader, CLoader import pandas   def setup(n):     numpy.random.seed(0)     data = numpy.random.rand(n, 3)      with open("out.yml", "w") as f:         yaml.dump(data.tolist(), f)      with open("out.json", "w") as f:         json.dump(data.tolist(), f, indent=4)      with open("out.dat", "w") as f:         numpy.savetxt(f, data)      with open("out.toml", "w") as f:         toml.dump({"data": data.tolist()}, f)   def yaml_python(arr):     with open("out.yml", "r") as f:         out = yaml.load(f, Loader=Loader)     return out   def yaml_c(arr):     with open("out.yml", "r") as f:         out = yaml.load(f, Loader=CLoader)     return out   def json_load(arr):     with open("out.json", "r") as f:         out = json.load(f)     return out   def ujson_load(arr):     with open("out.json", "r") as f:         out = ujson.load(f)     return out   def orjson_load(arr):     with open("out.json", "rb") as f:         out = orjson.loads(f.read())     return out   def loadtxt(arr):     with open("out.dat", "r") as f:         out = numpy.loadtxt(f)     return out   def pandas_read(arr):     out = pandas.read_csv("out.dat", header=None, sep=" ")     return out.values   def toml_load(arr):     with open("out.toml", "r") as f:         out = toml.load(f)     return out["data"]   perfplot.save(     "out.png",     setup=setup,     kernels=[         yaml_python,         yaml_c,         json_load,         loadtxt,         pandas_read,         toml_load,         ujson_load,         orjson_load,     ],     n_range=[2 ** k for k in range(18)], ) 
like image 21
Nico Schlömer Avatar answered Oct 10 '22 13:10

Nico Schlömer