Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to copy large data files line by line?

I have a 35GB CSV file. I want to read each line, and write the line out to a new CSV if it matches a condition.

try (BufferedWriter writer = Files.newBufferedWriter(Paths.get("source.csv"))) {
    try (BufferedReader br = Files.newBufferedReader(Paths.get("target.csv"))) {
        br.lines().parallel()
            .filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
            .forEach(line -> {
                writer.write(line + "\n");
        });
    }
}

This takes approx. 7 minutes. Is it possible to speed up that process even more?

like image 932
membersound Avatar asked Oct 22 '19 09:10

membersound


People also ask

What is the fastest way to copy large files?

Hold Ctrl and click multiple files to select them all, no matter where they are on the page. To select multiple files in a row, click the first one, then hold Shift while you click the last one. This lets you easily pick a large number of files to copy or cut.

How can I send 20 GB for free?

MyAirBridge. With MyAirBridge(Opens in a new window), you can upload a file and email a link to a specific recipient or just upload the file and generate a link to share with anyone. You can send a file as large as 20GB for free.


1 Answers

If it is an option you could use GZipInputStream/GZipOutputStream to minimize disk I/O.

Files.newBufferedReader/Writer use a default buffer size, 8 KB I believe. You might try a larger buffer.

Converting to String, Unicode, slows down to (and uses twice the memory). The used UTF-8 is not as simple as StandardCharsets.ISO_8859_1.

Best would be if you can work with bytes for the most part and only for specific CSV fields convert them to String.

A memory mapped file might be the most appropriate. Parallelism might be used by file ranges, spitting up the file.

try (FileChannel sourceChannel = new RandomAccessFile("source.csv","r").getChannel(); ...
MappedByteBuffer buf = sourceChannel.map(...);

This will become a bit much code, getting lines right on (byte)'\n', but not overly complex.

like image 166
Joop Eggen Avatar answered Sep 27 '22 18:09

Joop Eggen