Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading special characters with PyYaml

I'm working on loading a list of emoji characters in a simple python 3.6 script. The YAML structure is essentially as follows:

- 🙂   
- 😁
- 😬

My python script looks like this:

import yaml
f = open('emojis.yml')
EMOJIS = yaml.load(f)
f.close()

I'm getting the following exception:

yaml.reader.ReaderError: unacceptable character #x001d: special characters are not allowed in "emojis.yml", position 2

I have seen the allow_unicode=True option but that seems to only be available for yaml.dump. It appears that people have had some trouble with similar issues in Python2, but since all strings should be unicode, I'm having trouble figuring out why this isn't working.

I've also tried wrapping my emojis in quotes and using a customer constructor for 'tag:yaml.org,2002:str'. My custom constructor is never even hit presumably because the yaml lib is failing to recognize my emoji as having the string type. I also observe the same behavior when I define my emoji directly as a string in source.

Is there a way to load a yaml file containing emojis with PyYAML?

like image 450
Quinn Stearns Avatar asked Jul 02 '17 21:07

Quinn Stearns


2 Answers

You should upgrade to ruamel.yaml (disclaimer: I am the author of that package), which has this, and many other long standing PyYAML issues, fixed:

import sys
from ruamel.yaml import YAML

yaml = YAML()

with open('emojis.yml') as fp:
    idx = 0
    for c in fp.read():
        print('{:08x}'.format(ord(c)), end=' ')
        idx += 1
        if idx % 4 == 0:
            print()

with open('emojis.yml') as fp:
    data = yaml.load(fp)
yaml.dump(data, sys.stdout)

gives:

0000002d 00000020 0001f642 0000000a 
0000002d 00000020 0001f601 0000000a 
0000002d 00000020 0001f62c 0000000a 
['🙂', '😁', '😬']

If you really have to stick with PyYAML, you can do:

import yaml.reader
import re

yaml.reader.Reader.NON_PRINTABLE = re.compile(
    u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]')

to get rid of the error.


Starting with version 0.15.16, ruamel.yaml now also dumps all supplementary plane Unicode without reverting to \Uxxxxxxxx (controllable in the new API via .unicode_supplementary, and depending on allow_unicode).

like image 84
Anthon Avatar answered Sep 27 '22 18:09

Anthon


Update

the latest version of pyyaml has fixed this bug, upgrade to pyyaml>=5


Original answer

This seems to be a bug in pyyaml, a workaround is to use their escape sequences:

$ cat test.yaml
- "\U0001f642"
- "\U0001f601"
- "\U0001f62c"

$ python
...
>>> yaml.load(open('test.yaml'))
['🙂', '😁', '😬']
like image 44
Anthony Sottile Avatar answered Sep 27 '22 18:09

Anthony Sottile