BufferedReader: Determine byte offset of lines read

Question

I'm using a BufferedReader to read a byte stream (UTF-8 text) line by line. For a specific reason, I need to know where exactly in the byte stream the line starts.

The problem: I can't use the position of the InputStream I plug into the BufferedReader as - well - the reader buffers and reads more than a line in one go.

My question: How can I determine the precise byte offset of each line read?

One apparent (but incorrect) solution would be to use (line + " ").getBytes("UTF-8").length. There is two problems with this approach: 1) Just to count the number of bytes, it's quite an overhead to convert the string back to bytes and 2) a newline is not always marked with " " - it might also be " ", etc.

Is there any other solution to this?

EDIT: Every LineReader-like class I've seen so far seems to be buffered. Does anyone know of an unbuffered LineReader-like class?

Esailija · Accepted Answer

Just read the file as raw bytes, newline in UTF-8 will always either be 13 and 10, 13 or 10... but that's exactly the same problem you would have if you read the file as string if the files are going to have different EOL conventions.

The raw byte equivalent of BufferedReader is BufferedInputStream

You can also count UTF-8 bytes of a string without encoding:

public static int byteCountUTF8(String input) {
    int ret = 0;
    for (int i = 0; i < input.length(); ++i) {
        int cc = Character.codePointAt(input, i);
        if (cc <= 0x7F) {
            ret++;
        } else if (cc <= 0x7FF) {
            ret += 2;
        } else if (cc <= 0xFFFF) {
            ret += 3;
        } else if (cc <= 0x10FFFF) {
            ret += 4;
            i++;
        }
    }
    return ret;
}

BufferedReader: Determine byte offset of lines read

Tags:

java

utf-8

bufferedreader

Johannes

1 Answers

Esailija

Recent Activity

Donate For Us

BufferedReader: Determine byte offset of lines read

Tags:

java

utf-8

bufferedreader

Johannes

1 Answers

Esailija

Related questions

Recent Activity

Donate For Us