Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I process a large file via CSVParser?

I have a large .csv file (about 300 MB), which is read from a remote host, and parsed into a target file, but I don't need to copy all the lines to the target file. While copying, I need to read each line from the source and if it passes some predicate, add the line to the target file.

I suppose that Apache CSV ( apache.commons.csv ) can only parse whole file

CSVFormat csvFileFormat = CSVFormat.EXCEL.withHeader();
CSVParser csvFileParser = new CSVParser("filePath", csvFileFormat);
List<CSVRecord> csvRecords = csvFileParser.getRecords();

so I can't use BufferedReader. Based on my code, a new CSVParser() instance should be created for each line, which looks inefficient.

How can I parse a single line (with known header of the table) in the case above?

like image 535
Alex Orlov Avatar asked Aug 20 '15 16:08

Alex Orlov


1 Answers

No matter what you do, all of the data from your file is going to come over to your local machine because your system needs to parse through it to determine validity. Whether the file arrives via a file read through the parser (so you can parse each line), or whether you just copy the entire file over for parsing purposes, it will all come over to local. You will need to get the data local, then trim the excess.

Calling csvFileParser.getRecords() is already a lost battle because the documentation explains that that method loads every row of your file into memory. To parse the record while conserving active memory, you should instead iterate over each record; the documentation implies the following code loads one record to memory at a time:

CSVParser csvFileParser = CSVParser.parse(new File("filePath"), StandardCharsets.UTF_8, csvFileFormat);

for (CSVRecord csvRecord : csvFileParser) {
     ... // qualify the csvRecord; output qualified row to new file and flush as needed.
}

Since you explained that "filePath" is not local, the above solution is prone to failure due to connectivity issues. To eliminate connectivity issues, I recommend you copy the entire remote file over to local, ensure the file copied accurately by comparing checksums, parse the local copy to create your target file, then delete the local copy after completion.

like image 150
JoshDM Avatar answered Oct 14 '22 04:10

JoshDM