Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More efficient way to make unicode escape codes

I am using python to automatically generate qsf files for Qualtrics online surveys. The qsf file requires unicode characters to be escaped using the \u+hex convention: 'слово' = '\u0441\u043b\u043e\u0432\u043e'. Currently, I am achieving this with the following expression:

'слово'.encode('ascii','backslashreplace').decode('ascii')

The output is exactly what I need, but since this is a two-step process, I wondered if there is a more efficient way to get the same result.

like image 687
reynoldsnlp Avatar asked Dec 04 '25 13:12

reynoldsnlp


2 Answers

If you open your output file as 'wb', then it accepts a byte stream rather than unicode arguments:

s = 'слово'
with open('data.txt','wb') as f:
    f.write(s.encode('unicode_escape'))
    f.write(b'\n')  # add a line feed

This seems to do what you want:

$ cat data.txt
\u0441\u043b\u043e\u0432\u043e

and it avoids both the decode as well as any translation that happens when writing unicode to a text stream.


Updated to use encode('unicode_escape') as per the suggestion of @J.F.Sebastian.

%timeit reports that it is quite a bit faster than encode('ascii', 'backslashreplace'):

In [18]: f = open('data.txt', 'wb')

In [19]: %timeit f.write(s.encode('unicode_escape'))
The slowest run took 224.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.55 µs per loop

In [20]: %timeit f.write(s.encode('ascii','backslashreplace'))
The slowest run took 9.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.37 µs per loop

In [21]: f.close()

Curiously, the lag from timeit for encode('unicode_escape') is a lot longer than that from encode('ascii', 'backslashreplace') even though the per loop time is faster, so be sure to test both in your environment.

like image 145
Neapolitan Avatar answered Dec 07 '25 14:12

Neapolitan


I doubt that it is a performance bottleneck in your application but s.encode('unicode_escape') can be faster than s.encode('ascii', 'backslashreplace').

To avoid calling .encode() manually, you could pass the encoding to open():

with open(filename, 'w', encoding='unicode_escape') as file:
    print(s, file=file)

Note: it translates non-printable ascii characters too e.g., a newline is written as \n, tab as \t, etc.

like image 38
jfs Avatar answered Dec 07 '25 12:12

jfs



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!