Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

process a file line by line in concurrency way

now i am working on a job about data format transform. there is a large file, like 10GB, the current solution i implemented is read this file line by line, transform the format for each line, then output to a output file. i found the transform process is a bottle neck. so i am trying to do this in a concurrent way.

Each line is a complete unit, has nothing to do with other lines. Some lines may be discarded as some specific value in the line do not meet the demand.

now i have two plans:

  1. one thread read data line by line from input file, then put the line into a queue, several threads get lines from the queue, transform the format, then put the line into a output queue, finally an output thread reads lines from the output queue and writes to a output file.

  2. several threads currently read data from different part of the input file, then process the line and output to a file through a output queue or file lock.

would you guys please give me some advise ? i really appreciate it.

thanks in advance!

like image 240
BigPotato Avatar asked Dec 19 '12 14:12

BigPotato


2 Answers

I would go for the first option ... reading data from a file in small pieces normally is slower than reading the whole file at once (depending on file caches/buffering/read ahead etc).

You also might need to think about a way to create the output file (acquiring all lines from the different processes, possibly in the correct order if needed).

like image 102
Michel Keijzers Avatar answered Oct 06 '22 02:10

Michel Keijzers


Solution 1 makes sense.

This would also map nicely and simply to Java's Executor framework. Your main thread reads lines and submits each line to an Executor or ExecutorService.

It gets more complicated if you must keep order intact, though.

like image 24
HansMari Avatar answered Oct 06 '22 00:10

HansMari