I'm using a BufferedReader to read a byte stream (UTF-8 text) line by line. For a specific reason, I need to know where exactly in the byte stream the line starts.
The problem: I can't use the position of the InputStream I plug into the BufferedReader as - well - the reader buffers and reads more than a line in one go.
My question: How can I determine the precise byte offset of each line read?
One apparent (but incorrect) solution would be to use (line + "\n").getBytes("UTF-8").length. There is two problems with this approach: 1) Just to count the number of bytes, it's quite an overhead to convert the string back to bytes and 2) a newline is not always marked with "\n" - it might also be "\r\n", etc.
Is there any other solution to this?
EDIT: Every LineReader-like class I've seen so far seems to be buffered. Does anyone know of an unbuffered LineReader-like class?
Just read the file as raw bytes, newline in UTF-8 will always either be 13
and 10
, 13
or 10
... but that's exactly the same problem you would have if you read the file as string if the files are going to have different EOL conventions.
The raw byte equivalent of BufferedReader
is BufferedInputStream
You can also count UTF-8 bytes of a string without encoding:
public static int byteCountUTF8(String input) {
int ret = 0;
for (int i = 0; i < input.length(); ++i) {
int cc = Character.codePointAt(input, i);
if (cc <= 0x7F) {
ret++;
} else if (cc <= 0x7FF) {
ret += 2;
} else if (cc <= 0xFFFF) {
ret += 3;
} else if (cc <= 0x10FFFF) {
ret += 4;
i++;
}
}
return ret;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With