Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Opening memory-mapped file with encoding

Memory mapped file is an efficient way for using regex or doing manipulation on large binary files.

In case I have a large text file (~1GB), is it possible to work with an encoding-aware mapped file?
Regex like [\u1234-\u5678] won't work on bytes objects and converting the pattern to unicode will not work either (as "[\u1234-\u5678]".encode("utf-32") for example will not understand the range correctly).
Searching might work if I convert the search pattern from str to bytes using .encode() but it's still somewhat limited and there should be a simpler way instead of decoding and encoding all day.

I have tried wrapping it with io.TextIOWrapper inside an io.BufferedRandom but to no avail:

AttributeError: 'mmap.mmap' object has no attribute 'seekable'

Creating a wrapper (using inheritance) and setting the methods seekable, readable and writable to return True did not work either.

Regarding encoding, a fixed length encoding like utf-32, code-points or the lower BMP of utf-16 (if it's even possible referring just to that part) might be assumed.

Solution is welcome for any python version.

like image 971
Bharel Avatar asked Nov 08 '22 17:11

Bharel


1 Answers

You can't do this without essentially reinventing the wheel from scratch (writing all new versions of the re module, the mmap module, etc.), or writing extraordinarily complex regexes that can't use the niceties of stuff like true Unicode character ranges (you'd have an alternation between three different patterns to make [\u1234-\u5678], something like (?:\x12[\x34-\xff]|[\x13-\x55].|\x56[\x00-\x78])).

Basically, re patterns only work with str, or work with bytes-like objects (and you can't try to work around it with memoryviews and casts, because re still treats it as bytes, not larger types).

For simple searches, you could try using mmap.find after encoding the string to use for searching, but that's still prone to subtle bugs; for UCS-2 or UTF-32, you'd need to check that the return value from find was aligned on a two or four byte boundary respectively to ensure you didn't mistake the end of one character and the beginning of the next for a completely different character. If the alignment test failed, you'd have to repeat the search with a start offset of the last return value + 1 until you either got a hit or find returned -1. It's just not a reasonable thing to do in the general case.

like image 96
ShadowRanger Avatar answered Nov 14 '22 21:11

ShadowRanger