I am using python to automatically generate qsf files for Qualtrics online surveys. The qsf file requires unicode characters to be escaped using the \u+hex convention: 'слово' = '\u0441\u043b\u043e\u0432\u043e'. Currently, I am achieving this with the following expression:
'слово'.encode('ascii','backslashreplace').decode('ascii')
The output is exactly what I need, but since this is a two-step process, I wondered if there is a more efficient way to get the same result.
If you open your output file as 'wb', then it accepts a byte stream rather than unicode arguments:
s = 'слово'
with open('data.txt','wb') as f:
f.write(s.encode('unicode_escape'))
f.write(b'\n') # add a line feed
This seems to do what you want:
$ cat data.txt
\u0441\u043b\u043e\u0432\u043e
and it avoids both the decode as well as any translation that happens when writing unicode to a text stream.
Updated to use encode('unicode_escape') as per the suggestion of @J.F.Sebastian.
%timeit reports that it is quite a bit faster than encode('ascii', 'backslashreplace'):
In [18]: f = open('data.txt', 'wb')
In [19]: %timeit f.write(s.encode('unicode_escape'))
The slowest run took 224.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.55 µs per loop
In [20]: %timeit f.write(s.encode('ascii','backslashreplace'))
The slowest run took 9.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.37 µs per loop
In [21]: f.close()
Curiously, the lag from timeit for encode('unicode_escape') is a lot longer than that from encode('ascii', 'backslashreplace') even though the per loop time is faster, so be sure to test both in your environment.
I doubt that it is a performance bottleneck in your application but s.encode('unicode_escape') can be faster than s.encode('ascii', 'backslashreplace').
To avoid calling .encode() manually, you could pass the encoding to open():
with open(filename, 'w', encoding='unicode_escape') as file:
print(s, file=file)
Note: it translates non-printable ascii characters too e.g., a newline is written as \n, tab as \t, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With