How to efficiently split a text file with an arbitrary Charset without damaging code points?

Question

Given a valid text file file and its java.nio.charset.Charset how can I efficiently (preferably using RandomAccessFile.seek() or InputStream.skip(), without reading the whole file) split it into two or more chunks while ensuring that no chunks contain partial code points (it would be nice not to split a character/grapheme, but that's probably too hard)?

For fixed-length encodings the answer is trivial - split on aligned positions. However, I'm not sure if CharsetDecoder.averageBytesPerChar() == CharsetDecoder.maxBytesPerChar() is a proper indication of a fixed length encoding, so it would be nice to find one.

Some variable length encodings are not self-synchronizing, how can I find their character or code point in a byte stream after split?

_{The objective is to split the file for parallel processing. Classic IO fails to load all CPU cores while reading a file sequentially.}

rzwitserloot · Accepted Answer

What you want is completely impossible.

Charset is a broad abstraction. The intent is that as many imaginable charset encodings can be represented by it as possible. This is a general principle in programming: The more 'handcuffs' you stick on an interface, the easier it is to work with and the more you can do with it, but the fewer systems actually fit in the abstraction.

Perhaps instead you wanted to ask a slightly different question:

How do I split a text file written in UTF-8.

Or perhaps:

How do I split a text file for as many encodings as possible.

For the UTF family it is quite easy. But note that UTF-8 is generally praised as a magnificent invention because it has the property that the thing you want to do is in fact possible. Simple takes on charset design result in designs where the job you want is literally impossible.

Imagine the following charset encoding for unicode:

A single byte indicating how many bytes are used to represent each character. So, 1, 2, 3, or 4. Let's call this charSize.
A 32-bit value indicating how many characters follow, represented by that encoding. Let's call this runLen.
runLen * charSize bytes.
Go back to #1.

This encoding is plausible and could trivially be turned into a complete java Charset definition.

And yet, the thing you want is not possible. In a random access system you can still get through it faster than literally streaming through it (once you read #1 and #2 you can just skip over runLen * charSize part if you don't need to split inside it), but you can't simply read some bytes in the middle and know where to chop. You must start from the first byte, and there are no method that Charset has which would allow charset-agnostic code to most efficiently chop such a file up. A 'chopper' for this hypothetical format CAN exist and can be fairly efficient, but it'd have to written specifically for this exact charset encoding.

QED: An algorithm that can efficiently chop any input given only 'random access stream of bytes' and '1 Charset impl' is not possible.

How to do it for UTF-8 specifically

A single UTF-8 value has the convenient property that any byte that starts with bits 10 is a continuation; and any byte that isn't that, defines a character and may have continuations (bytes starting with bits 10). You in fact know how many continuation bytes follow based only on that first byte:

First bit is 0: That byte is a whole character (ASCII).
First bits are 10: This is a continuation. Go back.
First bits are 110: This is a 2-byte thing.
First bits are 1110: 3-byte.
First bits are 11110: 4-byte.

Therefore, to split a file in the middle, simply read a byte, and keep reading until you hit a byte that does not start with 10 (i.e. (b & 0xC0) != 0xC0)) - that is the start of a whole new character. Include all bytes before this point in 'the left chunk' and the not-10 byte + all that remains in the 'right chunk'.

A warning about unicode

Unicode is a lot more complicated than this. UTF-8 can be trivially lopped into bits with random access fast performance, but a sequence of complete symbols nevertheless go together. For example, this sequence of unicode values:

U+0065
U+0301

Is the symbol é. 2 unicode values (not UTF-8 bytes; no, full unicode values. In UTF-8 terms, 2 bytes that do not start with 0b10) - one symbol. That's the plain jane ascii e plus the unicode symbol "put ´ on the previous symbol".

Similar shenanigans occur with emoji (you can easily hook together 7 or more emojis which are themselves in java surrogate pairs, for strings whose .length() would return 14 or more, and yet they are 1 single glyph. Flag emojis work like this, and you have modifiers. You can have 'hug' + 'man' + 'man' + 'brown' + 'olive' or whatever - to indicate the genders and colours of who is doing the hugging.

Another source of problems is directional indicates (there's a unicode character that means ".. and now the text goes right-to-left").

If you chop a text file in twain, even if you do it right and use the 0b10 trick to ensure you don't chop right through a single unicode value, you can still end up with one file ending in "e" and the next file starting with "´", whereas the source simply had a "é".

You should think about this. If it's important you don't chop emoji modifiers and/or decomposed chars and/or directional modifiers, hoo boy. This question boils down to extremely convoluted code. Think "manmonth of work" levels of complicated.

Remember: Unicode; it's more complicated than you think it is.

How to efficiently split a text file with an arbitrary Charset without damaging code points?

Tags:

java

character-encoding

java-nio

Basilevs

1 Answers

How to do it for UTF-8 specifically

A warning about unicode

rzwitserloot

Recent Activity

Donate For Us

How to efficiently split a text file with an arbitrary Charset without damaging code points?

Tags:

java

character-encoding

java-nio

Basilevs

1 Answers

How to do it for UTF-8 specifically

A warning about unicode

rzwitserloot

Related questions

Recent Activity

Donate For Us