Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read lines of characters and get file position

Tags:

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.

At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.

Questions

Is there an easy way to do both, preferably using standard Java libraries?

If not, what is a reasonable workaround?

Attributes of an ideal solution

An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.

Details

For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.

Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:

To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.

The method RandomAccessFile.readLine() reads a single byte per character.

Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

like image 489
Andy Thomas Avatar asked Jun 03 '15 18:06

Andy Thomas


2 Answers

If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:

fileReader.getChannel().position(); 

after a call to bufferedReader.readLine().

The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.

Alternate Solution What would be wrong with keeping track of the bytes yourself:

long startingPoint = 0; // or starting position if this file has been previously processed  while (readingLines) {     String line = bufferedReader.readLine();     startingPoint += line.getBytes().length; } 

this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.

like image 101
Jeff Avatar answered Sep 22 '22 14:09

Jeff


This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).

In UTF-8:

  • All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
  • All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.

Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.

To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.

Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.

protected static String  readNextLineAsUTF8( RandomAccessFile in ) throws IOException {     String rv = null;     String lineBytes = in.readLine();     if ( null != lineBytes ) {         rv = new String( lineBytes.getBytes(),             StandardCharsets.UTF_8 );     }     return rv;  }  

Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:

    long startPos = in.getFilePointer();     String line = readNextLineAsUTF8( in ); 
like image 42
Andy Thomas Avatar answered Sep 21 '22 14:09

Andy Thomas