Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OutOfMemoryError reading a 174 Mb text file with large rows

I have a csv file with 12000 rows. Each row has several fields enclosed in double quotes and separated by comma. One of this field is an xml document, thus the row can be very long. The file size is 174 Mb.

Here is an example of the file:

"100000","field1","field30","<root><data>Hello I have a
line break</data></root>","field31"
"100001","field1","field30","<root><data>Hello I have multiple
line 
break</data></root>","field31"

The problem with this file is inside the xml field which can have one or more line breaks and thus can break the parsing. The goal here is to read the whole file and apply a regex which will replace all the line breaks inside double quotes with an empty string.

The following code gives me OutOfMemoryError:

    String path = "path/to/file.csv";

    try {
        byte[] content = Files.readAllBytes(Paths.get(path));
    }
    catch (Exception e) {
        e.printStackTrace();
        System.exit(1);
    }

I've also tried to read the file using BufferedReader and StringBuilder, got OutOfMemoryError around line 5000:

String path = "path/to/file.csv";

    try {
        StringBuilder sb = new StringBuilder();
        BufferedReader br = new BufferedReader(new FileReader(path));
        String line;
        int count = 0;
        while ((line = br.readLine()) != null) {
            sb.append(line);
            System.out.println("Read " + count++);
        }
    }
    catch (Exception e) {
        e.printStackTrace();
        System.exit(1);
    }

I've tried to run both of the programs above with different java heap values, like -Xmx1024m, -Xmx4096m, -Xmx8092m. In all cases I got OutOfMemoryError. Why is this happening, considering that the file size is 174Mb?

like image 229
revy Avatar asked Mar 27 '19 11:03

revy


1 Answers

You need to use double buffers to parse your special data structure, and process them line-by-line. Reading the whole document is not the best idea.

Create an own BufferedReader that reads lines with an inner BufferedReader of your CSV file. After reading a line, try to determine whether you need to read more lines to finish one line in CSV (e.g. if you know that your XML starts with <root> and ends with </root>, check the presence of these strings, and read and append until you reach the closing token - that will be the last line for your CSV line).

The second layer will be your CSV processing, based in the CSV line you get from the first step. Parse it, save it, process it, then throw it. Then it will not consume more memory space, the Java Garbage Collector will free it up.

This is the only way to deal with large files. It is also called "streaming model", because you pass only small chunks of data through, so the actual memory consumption is low.

like image 84
gaborsch Avatar answered Oct 11 '22 20:10

gaborsch