Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with a huge, one-line file in Java

I need to read a huge file (15+GB) and perform some minor modifications (add some newlines so a different parser can actually work with it). You might think that there are already answers for doing this normally:

  • Reading a very huge file in java
  • How to read a large text file line by line using Java?

but my entire file is on one line.

My general approach so far is very basic:

char[] buffer = new char[X];
BufferedReader reader = new BufferedReader(new ReaderUTF8(new FileInputStream(new File("myFileName"))), X);
char[] bufferOut = new char[X+a little];
int bytesRead = -1;
int i = 0;
int offset = 0;
long totalBytesRead = 0;
int countToPrint = 0;
while((bytesRead = reader.read(buffer)) >= 0){
    for(i = 0; i < bytesRead; i++){
        if(buffer[i] == '}'){
            bufferOut[i+offset] = '}';
            offset++;
            bufferOut[i+offset] = '\n';
        }
        else{
            bufferOut[i+offset] = buffer[i];
        }
    }
    writer.write(bufferOut, 0, bytesRead+offset);
    offset = 0;
    totalBytesRead += bytesRead;
    countToPrint += 1;
    if(countToPrint == 10){
        countToPrint = 0;
        System.out.println("Read "+((double)totalBytesRead / originalFileSize * 100)+" percent.");
    }
}
writer.flush();

After some experimentation, I've found that a value of X larger than a million gives optimal speed - it looks like I'm getting about 2% every 10 minutes, while a value of X of ~60,000 only got 60% in 15 hours. Profiling reveals that I'm spending 96+% of my time in the read() method, so that's definitely my bottleneck. As of writing this, my 8 million X version has finished 32% of the file after 2 hours and 40 minutes, in case you want to know how it performs long-term.

Is there a better approach for dealing with such a large, one-line file? As in, is there a faster way of reading this type of file that gives me a relatively easy way of inserting the newline characters?

I am aware that different languages or programs could probably handle this gracefully, but I'm limiting this to a Java perspective.

like image 360
Jeutnarg Avatar asked Jun 03 '16 18:06

Jeutnarg


People also ask

How do you update a specific line in a file in Java?

Invoke the replaceAll() method on the obtained string passing the line to be replaced (old line) and replacement line (new line) as parameters. Instantiate the FileWriter class. Add the results of the replaceAll() method the FileWriter object using the append() method.


1 Answers

You are making this far more complicated than it should be. By just making use of the buffering already provided by the standard classes you should get a thorughput of at least several MB per second without any hassles.

This simple test program processes 1GB in less than 2 minutes on my PC (including creating the test file):

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.Writer;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.Random;

public class TestFileProcessing {

    public static void main(String[] argv) {

        try {
            long time = System.currentTimeMillis();
            File from = new File("C:\\Test\\Input.txt");
            createTestFile(from, StandardCharsets.UTF_8, 1_000_000_000);
            System.out.println("Created file in: " + (System.currentTimeMillis() - time) + "ms");

            time = System.currentTimeMillis();
            File to = new File("C:\\Test\\Output.txt");
            doIt(from, to, StandardCharsets.UTF_8);
            System.out.println("Converted file in: " + (System.currentTimeMillis() - time) + "ms");
        } catch (IOException e) {
            throw new RuntimeException(e.getMessage(), e);
        }
    }

    public static void createTestFile(File file, Charset encoding, long size) throws IOException {
        Random r = new Random(12345);
        try (OutputStream fout = new FileOutputStream(file);
                BufferedOutputStream bout = new BufferedOutputStream(fout);
                Writer writer = new OutputStreamWriter(bout, encoding)) {
            for (long i=0; i<size; ++i) {
                int c = r.nextInt(26);
                if (c == 0)
                    writer.write('}');
                else
                    writer.write('a' + c);
            }
        }
    }

    public static void doIt(File from, File to, Charset encoding) throws IOException {
        try (InputStream fin = new FileInputStream(from);
                BufferedInputStream bin = new BufferedInputStream(fin);
                Reader reader = new InputStreamReader(bin, encoding);
                OutputStream fout = new FileOutputStream(to);
                BufferedOutputStream bout = new BufferedOutputStream(fout);
                Writer writer = new OutputStreamWriter(bout, encoding)) {
            int c;
            while ((c = reader.read()) >= 0) {
                if (c == '}')
                    writer.write('\n');
                writer.write(c);
            }
        }
    }

}

As you see no elaborate logic or excessive buffer sizes are used. What is used is simply buffering the streams closest to the hardware, the FileInput/OutputStream.

like image 123
Durandal Avatar answered Sep 27 '22 18:09

Durandal