Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Java List traversal slower than file readline?

I had this piece of code:

while((line=br.readLine())!=null)
        {
            String Words[]= line.split(" ");
            outputLine = SomeAlgorithm(Words);
            output.write(outputLine);
        }

As you can see in the above code, for every line in the input file I'm reading one line, running some algorithm on it which modifies that line read basically, and then writes the output line to some file.

There are 9k lines in the file, and the entire program took 3 minutes on my machine.

I thought, okay, I'm doing 2 I/Os for every (line) run of the algorithm. So I'm doing around 18k I/Os. Why not collect all the lines first into an ArrayList , then loop through the list and run the algorithm on each line? Also collect each output into one string variable, and then write out all the output once at the end of the program.

That way, I'd have total 2 big I/Os for the entire program (18k small File I/Os to 2 big File I/Os). I thought this would be faster, so I wrote this:

List<String> lines = new ArrayList<String>();
while((line=br.readLine())!=null)
        {
            lines.add(line); // collect all lines first
        }

for (String line : lines){
    String Words[] = line.split(" ");
    bigOutput+=SomeAlgorithm(Words); // collect all output
}

output.write(bigOutput);

But, this thing took 7 minutes !!!

So, why is looping through ArrayList slower than reading a file line by line?

Note : Collecting all lines by readLine() and writing the bigOutput are each taking only a few seconds. There is no change made to SomeAlgorithm() either. So, definitely, I think the culprit is for (String line: lines)

Update: As mentioned in the various comments below, the problem was not with ArrayList traversal , it was with the way the output was accumulated using += . Shifting to StringBuilder() did give a faster-than-original result.

like image 987
sanjeev mk Avatar asked Aug 10 '14 18:08

sanjeev mk


2 Answers

I suspect the difference in performance is due to how you are collecting the output in one variable (bigOutput). My conjecture is that this involves lots of memory reallocations and copying of character data, which is the real cause of the slowness.

like image 106
NPE Avatar answered Sep 20 '22 13:09

NPE


This depends on the size of the file, but likely what is going on here is that it takes longer to resize the ArrayList storage and concatenate strings multiple times than it does to make a lot of small file operations.

Bear in mind that the disk and the OS both perform some level of I/O caching, and some of this involved read-ahead (with the expectation being that you are likely to read data sequentially) so the first read is likely stuffing quite a bit of the file into the I/O cache, from which you can read very quickly.

You are therefore trading small reads from the I/O cache for many resizes of flat arrays (the ArrayList and output sting), both of which become slower and slower each time.

tl;dr version: Let the various I/O caches do their job.

like image 22
cdhowie Avatar answered Sep 17 '22 13:09

cdhowie