How is it that json serialization is so much faster than yaml serialization in Python?

Tags:

I have code that relies heavily on yaml for cross-language serialization and while working on speeding some stuff up I noticed that yaml was insanely slow compared to other serialization methods (e.g., pickle, json).

So what really blows my mind is that json is so much faster that yaml when the output is nearly identical.

>>> import yaml, cjson; d={'foo': {'bar': 1}} >>> yaml.dump(d, Dumper=yaml.SafeDumper) 'foo: {bar: 1}\n' >>> cjson.encode(d) '{"foo": {"bar": 1}}' >>> import yaml, cjson; >>> timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000) 44.506911039352417 >>> timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000) 16.852826118469238 >>> timeit("cjson.encode(d)", setup="import cjson; d={'foo': {'bar': 1}}", number=10000) 0.073784112930297852

PyYaml's CSafeDumper and cjson are both written in C so it's not like this is a C vs Python speed issue. I've even added some random data to it to see if cjson is doing any caching, but it's still way faster than PyYaml. I realize that yaml is a superset of json, but how could the yaml serializer be 2 orders of magnitude slower with such simple input?

643

asked Mar 16 '10 02:03

guidoism

1 Answers

JSON’s foremost design goal is simplicity and universality. Thus, JSON is trivial to generate and parse, at the cost of reduced human readability. It also uses a lowest common denominator information model, ensuring any JSON data can be easily processed by every modern programming environment.

In contrast, YAML’s foremost design goals are human readability and support for serializing arbitrary native data structures. Thus, YAML allows for extremely readable files, but is more complex to generate and parse. In addition, YAML ventures beyond the lowest common denominator data types, requiring more complex processing when crossing between different programming environments.

I'm not a YAML parser implementor, so I can't speak specifically to the orders of magnitude without some profiling data and a big corpus of examples. In any case, be sure to test over a large body of inputs before feeling confident in benchmark numbers.

Update Whoops, misread the question. :-( Serialization can still be blazingly fast despite the large input grammar; however, browsing the source, it looks like PyYAML's Python-level serialization constructs a representation graph whereas simplejson encodes builtin Python datatypes directly into text chunks.

187

answered Sep 29 '22 12:09

cdleary

Related questions
                            
                                Detect File Change Without Polling [duplicate]
                            
                                How to deal with Pylint's "too-many-instance-attributes" message?
                            
                                How to get files in a directory, including all subdirectories
                            
                                How to simply add a column level to a pandas dataframe
                            
                                Pandas get frequency of item occurrences in a column as percentage [duplicate]
                            
                                How to implement a Boolean search with multiple columns in pandas
                            
                                Python SimpleHTTPServer with PHP
                            
                                Pandas apply but only for rows where a condition is met
                            
                                No module named 'pymysql'
                            
                                How to set attributes using property decorators?
                            
                                Why doesn't Python give any error when quotes around a string do not match?
                            
                                Failed to activate virtualenv with pyenv
                            
                                Passing table name as a parameter in psycopg2
                            
                                Alembic: IntegrityError: "column contains null values" when adding non-nullable column
                            
                                Serving dynamically generated ZIP archives in Django
                            
                                How to extract an arbitrary line of values from a numpy array?
                            
                                How do I use pytest with virtualenv?
                            
                                django.core.exceptions.ImproperlyConfigured: Error loading psycopg module: No module named psycopg
                            
                                ImportError: No module named virtualenv
                            
                                Python / Pandas - GUI for viewing a DataFrame or Matrix [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How is it that json serialization is so much faster than yaml serialization in Python?

Tags:

python

json

serialization

yaml

guidoism

People also ask

1 Answers

cdleary

Recent Activity

Donate For Us