Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BufferedReader: Determine byte offset of lines read

I'm using a BufferedReader to read a byte stream (UTF-8 text) line by line. For a specific reason, I need to know where exactly in the byte stream the line starts.

The problem: I can't use the position of the InputStream I plug into the BufferedReader as - well - the reader buffers and reads more than a line in one go.

My question: How can I determine the precise byte offset of each line read?

One apparent (but incorrect) solution would be to use (line + "\n").getBytes("UTF-8").length. There is two problems with this approach: 1) Just to count the number of bytes, it's quite an overhead to convert the string back to bytes and 2) a newline is not always marked with "\n" - it might also be "\r\n", etc.

Is there any other solution to this?

EDIT: Every LineReader-like class I've seen so far seems to be buffered. Does anyone know of an unbuffered LineReader-like class?

like image 710
Johannes Avatar asked Nov 12 '22 13:11

Johannes


1 Answers

Just read the file as raw bytes, newline in UTF-8 will always either be 13 and 10, 13 or 10... but that's exactly the same problem you would have if you read the file as string if the files are going to have different EOL conventions.

The raw byte equivalent of BufferedReader is BufferedInputStream

You can also count UTF-8 bytes of a string without encoding:

public static int byteCountUTF8(String input) {
    int ret = 0;
    for (int i = 0; i < input.length(); ++i) {
        int cc = Character.codePointAt(input, i);
        if (cc <= 0x7F) {
            ret++;
        } else if (cc <= 0x7FF) {
            ret += 2;
        } else if (cc <= 0xFFFF) {
            ret += 3;
        } else if (cc <= 0x10FFFF) {
            ret += 4;
            i++;
        }
    }
    return ret;
}
like image 182
Esailija Avatar answered Nov 15 '22 06:11

Esailija