Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is ReversedLinesFileReader so slow?

I have a file that is 21.6GB and I want to read it from the end to the start rather than from the beginning to the end as you would usually do.

If I read each line of the file from the start to the end using the following code, then it takes 1 minute, 12 seconds.

val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLine {
    val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())

Now, I have read that to read in a file in reverse then I should use ReversedLinesFileReader from Apache Commons. I have created the following extension function to do just this:

fun File.forEachLineFromTheEndOfFile(action: (line: String) -> Unit) {
    val reader = ReversedLinesFileReader(this, Charset.defaultCharset())
    var line = reader.readLine()
    while (line != null) {
        action.invoke(line)
        line = reader.readLine()
    }

    reader.close()
}

and then call it in the following way, which is the same as the previous way only with a call to forEachLineFromTheEndOfFile function:

val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLineFromTheEndOfFile {
    val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())

This took 17 minutes and 50 seconds to run!

  • Am I using ReversedLinesFileReader in the correct way?
  • I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
  • Is it just the case that files should not be read from the end to the start?
like image 619
Arthur Avatar asked Sep 19 '16 20:09

Arthur


1 Answers

You are asking for a very expensive operation. Not only are you using random access in blocks to read the file and going backwards (so if the file system is reading ahead, it is reading the wrong direction), you are also reading an XML file which is UTF-8 and the encoding is slower than a fixed byte encoding.

Then on top of that you are using a less than efficient algorithm. It reads a block at a time of inconvenient size (is it disk block size aware? are you setting the block size to match your file system?) backwards while processing encoding and makes (unnecessary?) copy of the partial byte array and then turns it into a string (do you need a string to parse?). It could create the string without the copy and really creating the string probably could be deferred and you work directly from the buffer only decoding if you need to (XML parsers for example also work from ByteArrays or buffers). And there are other array copies that just are not needed but it is more convenient for the code.

It also might have a bug in that it checks for newlines without considering that the character might mean something different if actually is part of a multi-byte sequence. It would have to look back a few extra characters to check this for variable length encodings, I don't see it doing that.

So instead of a nice forward only heavily buffered sequential read of a file which is the fastest thing you can do on your filesystem, you are doing random reads of 1 block at a time. It should at least read multiple disk blocks so that it can use the forward momentum (set blocksize to some multiple of your disk block size will help) and also avoid the number of "left over" copies made at buffer boundaries.

There are probably faster approaches. But it'll not be as fast as reading a file in forward order.

UPDATE:

Ok, so I tried an experiment with a rather silly version that processes around 27G of data by reading the first 10 million lines from wikidata JSON dump and reversing those lines.

Timings on my 2015 Mac Book Pro (with all my dev stuff and many chrome windows open eating memory and some CPU all the time, about 5G of total memory is free, VM size is default with no parameters set at all, not run under debugger):

reading in reverse order: 244,648 ms = 244 secs = 4 min 4 secs
reading in forward order:  77,564 ms =  77 secs = 1 min 17 secs

temp file count:   201
approx char count: 29,483,478,770 (line content not including line endings)
total line count:  10,050,000

The algorithm is to read the original file by lines buffering 50000 lines at a time, writing the lines in reverse order to a numbered temp file. Then after all files are written, they are read in reverse numerical order forward by lines. Basically dividing them into reverse sort order fragments of the original. It could be optimized because this is the most naive version of that algorithm with no tuning. But it does do what file systems do best, sequential reads and sequential writes with good sized buffers.

So this is a lot faster than the one you were using and it could be tuned from here to be more efficient. You could trade CPU for disk I/O size and try using gzipped files as well, maybe a two threaded model to have the next buffer gzipping while processing the previous. Less string allocations, checking each file function to make sure nothing extra is going on, make sure no double buffering, and more.

The ugly but functional code is:

package com.stackoverflow.reversefile

import java.io.File
import java.util.*

fun main(args: Array<String>) {
    val maxBufferSize = 50000
    val lineBuffer = ArrayList<String>(maxBufferSize)
    val tempFiles = ArrayList<File>()
    val originalFile = File("/data/wikidata/20150629.json")
    val tempFilePrefix = "/data/wikidata/temp/temp"
    val maxLines = 10000000

    var approxCharCount: Long = 0
    var tempFileCount = 0
    var lineCount = 0

    val startTime = System.currentTimeMillis()

    println("Writing reversed partial files...")

    try {
        fun flush() {
            val bufferSize = lineBuffer.size
            if (bufferSize > 0) {
                lineCount += bufferSize
                tempFileCount++
                File("$tempFilePrefix-$tempFileCount").apply {
                    bufferedWriter().use { writer ->
                        ((bufferSize - 1) downTo 0).forEach { idx ->
                            writer.write(lineBuffer[idx])
                            writer.newLine()
                        }
                    }
                    tempFiles.add(this)
                }
                lineBuffer.clear()
            }

            println("  flushed at $lineCount lines")
        }

        // read and break into backword sorted chunks
        originalFile.bufferedReader(bufferSize = 4096 * 32)
                .lineSequence()
                .takeWhile { lineCount <= maxLines }.forEach { line ->
                    lineBuffer.add(line)
                    if (lineBuffer.size >= maxBufferSize) flush()
                }
        flush()

        // read backword sorted chunks backwards
        println("Reading reversed lines ...")
        tempFiles.reversed().forEach { tempFile ->
            tempFile.bufferedReader(bufferSize = 4096 * 32).lineSequence()
                .forEach { line ->
                    approxCharCount += line.length
                    // a line has been read here
                }
            println("  file $tempFile current char total $approxCharCount")
        }
    } finally {
        tempFiles.forEach { it.delete() }
    }

    val elapsed =  System.currentTimeMillis() - startTime

    println("temp file count:   $tempFileCount")
    println("approx char count: $approxCharCount")
    println("total line count:  $lineCount")
    println()
    println("Elapsed:  ${elapsed}ms  ${elapsed / 1000}secs  ${elapsed / 1000 / 60}min  ")

    println("reading original file again:")
    val againStartTime = System.currentTimeMillis()
    var againLineCount = 0
    originalFile.bufferedReader(bufferSize = 4096 * 32)
            .lineSequence()
            .takeWhile { againLineCount <= maxLines }
            .forEach { againLineCount++ }
    val againElapsed =  System.currentTimeMillis() - againStartTime
    println("Elapsed:  ${againElapsed}ms  ${againElapsed / 1000}secs  ${againElapsed / 1000 / 60}min  ")
}
like image 194
Jayson Minard Avatar answered Oct 29 '22 01:10

Jayson Minard