Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process rows of a CSV file using Groovy/GPars most efficiently?

Tags:

groovy

The question is a simple one and I am surprised it did not pop up immediately when I searched for it.

I have a CSV file, a potentially really large one, that needs to be processed. Each line should be handed to a processor until all rows are processed. For reading the CSV file, I'll be using OpenCSV which essentially provides a readNext() method which gives me the next row. If no more rows are available, all processors should terminate.

For this I created a really simple groovy script, defined a synchronous readNext() method (as the reading of the next line is not really time consuming) and then created a few threads that read the next line and process it. It works fine, but...

Shouldn't there be a built-in solution that I could just use? It's not the gpars collection processing, because that always assumes there is an existing collection in memory. Instead, I cannot afford to read it all into memory and then process it, it would lead to outofmemory exceptions.

So.... anyone having a nice template for processing a CSV file "line by line" using a couple of worker threads?

like image 963
Sven Haiges Avatar asked Oct 13 '11 14:10

Sven Haiges


1 Answers

Concurrently accessing a file might not be a good idea and GPars' fork/join-processing is only meant for in-memory data (collections). My sugesstion would be to read the file sequentially into a list. When the list reaches a certain size, process the entries in the list concurrently using GPars, clear the list and then move on with reading lines.

like image 99
Christoph Metzendorf Avatar answered Oct 24 '22 01:10

Christoph Metzendorf