I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
For the first requirement, BufferedReader.readLine()
is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader
also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine()
reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
If you construct a BufferedReader
from a FileReader
and keep an instance of the FileReader
accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine()
.
The BufferedReader
could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed while (readingLines) { String line = bufferedReader.readLine(); startingPoint += line.getBytes().length; }
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.
This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile
. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String readNextLineAsUTF8( RandomAccessFile in ) throws IOException { String rv = null; String lineBytes = in.readLine(); if ( null != lineBytes ) { rv = new String( lineBytes.getBytes(), StandardCharsets.UTF_8 ); } return rv; }
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in
:
long startPos = in.getFilePointer(); String line = readNextLineAsUTF8( in );
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With