Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read utf-8 character from byte stream

Given a stream of bytes (generator, file, etc.) how can I read a single utf-8 encoded character?

  • This operation must consume the bytes of that character from the stream.
  • This operation must not consume any bytes of the stream that exceed the first character.
  • This operation should succeed on any Unicode character.

I could approach this by rolling my own utf-8 decoding function but I would prefer not to reinvent the wheel since I'm sure this functionality must already be used elsewhere to parse utf-8 strings.

like image 498
arcyqwerty Avatar asked Oct 31 '22 04:10

arcyqwerty


1 Answers

Wrap the stream in a TextIOWrapper with encoding='utf8', then call .read(1) on it.

This is assuming you started with a BufferedIOBase or something duck-type compatible with it (i.e. has a read() method). If you have a generator or iterator, you may need to adapt the interface.

Example:

from io import TextIOWrapper

with open('/path/to/file', 'rb') as f:
  wf = TextIOWrapper(f, 'utf-8')
  wf._CHUNK_SIZE = 1  # Implementation detail, may not work everywhere

  wf.read(1) # gives next utf-8 encoded character
  f.read(1)  # gives next byte
like image 76
Kevin Avatar answered Nov 29 '22 04:11

Kevin