Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with a very large text file?

I'm currently writing something that needs to handle very large text files (a few GiB at least). What's needed here (and this is fixed) is:

  • CSV-based, following RFC 4180 with the exception of embedded line breaks
  • random read access to lines, though mostly line by line and near the end
  • appending lines at the end
  • (changing lines). Obviously that calls for the rest of the file to be rewritten, it's also rare, so not particularly important at the moment

The size of the file forbids keeping it completely in memory (which is also not desirable, since when appending the changes should be persisted as soon as possible).

I have thought of using a memory-mapped region as a window into the file which gets moved around if a line outside its range is requested. Of course, at that stage I still have no abstraction above the byte level. To actually work with the contents I have a CharsetDecoder giving me a CharBuffer. Now the problem is, I can deal with lines of text probably just fine in the CharBuffer, but I also need to know the byte offset of that line within the file (to keep a cache of line indexes and offsets so I don't have to scan the file from the beginning again to find a specific line).

Is there a way to map the offsets in a CharBuffer to offsets in the matching ByteBuffer at all? It's obviously trivial with ASCII or ISO-8859-*, less so with UTF-8 and with ISO 2022 or BOCU-1 things would get downright ugly (not that I actually expect the latter two, but UTF-8 should be the default here – and still poses problems).

I guess I could just convert a portion of the CharBuffer to bytes again and use the length. Either it works or I get problems with diacritics in which case I could probably mandate the use of NFC or NFD to assure that the text is always unambiguously encoded.

Still, I wonder if that is even the way to go here. Are there better options?

ETA: Some replies to common questions and suggestions here:

This is a data storage for simulation runs, intended to be a small-ish local alternative to a full-blown database. We do have database backends as well and they are used, but for cases where they are unavailable or not applicable we do want this.

I'm also only supporting a subset of CSV (without embedded line breaks), but that's ok for now. The problematic points here are pretty much that I cannot predict how long the lines are and thus need to create a rough map of the file.

As for what I outlined above: The problem I was pondering was that I can easily determine the end of a line on the character level (U+000D + U+000A), but I didn't want to assume that this looks like 0A 0D on the byte level (which already fails for UTF-16, for example, where it's either 0D 00 0A 00 or 00 0D 00 0A). My thoughts were that I could make the character encoding changable by not hard-coding details of the encoding I currently use. But I guess I could just stick to UTF-8 and ingore everything else. Feels wrong, somehow, though.

like image 287
Joey Avatar asked Jan 18 '11 10:01

Joey


2 Answers

It's very difficult to maintain a 1:1 mapping between a sequence of Java chars (which are effectively UTF-16) and bytes which could be anything depending on your file encoding. Even with UTF-8, the "obvious" mapping of 1 byte to 1 char only works for ASCII. Neither UTF-16 nor UTF-8 guarantees that a unicode character can be stored in a single machine char or byte.

I would maintain my window into the file as a byte buffer, not a char buffer. Then to find line endings in the byte buffer, I'd encode the Java string "\r\n" (or possibly just "\n") as a byte sequence using the same encoding as the file is in. I'd then use that byte sequence to search for line endings in the byte buffer. The position of a line ending in the buffer + the offset of the buffer from the start of the file maps exactly to the byte position in the file of the line ending.

Appending lines is just a case of seeking to the end of the file and adding your new lines. Changing lines is more tricky. I think I would maintain a list or map of byte positions of changed lines and what the change is. When ready to write the changes:

  1. sort the list of changes by byte position
  2. read the original file up to the next change and write it to a temporary file.
  3. write the changed line to the temporary file.
  4. skip the changed line in the original file.
  5. go back to step 2 unless you have reached the end of the original file
  6. move the temp file over the original file.
like image 70
JeremyP Avatar answered Oct 04 '22 15:10

JeremyP


Would it be possible to split the file in "subfiles" (of course you must not split it within one Utf-8 char)? Then you need some meta data for each of the subfiles (total number of chars, and total number of lines).

If you have this and the "subfiles" are relative small so that you can always load one compleatly then the handling becomes easy.

Even the editing becomes easy to, because you only need to update the "subfile" and its meta data.

If you would put it to the edge: then you could use a database and store one line per data base row. -- If this is a good idea strongly depends on your use case.

like image 28
Ralph Avatar answered Oct 04 '22 16:10

Ralph