Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best strategy for processing large CSV files in Apache Camel

Tags:

apache-camel

I'd like to develop a route that polls a directory containing CSV files, and for every file it unmarshals each row using Bindy and queues it in activemq.

The problem is files can be pretty large (a million rows) so I'd prefer to queue one row at a time, but what I'm getting is all the rows in a java.util.ArrayList at the end of Bindy which causes memory problems.

So far I have a little test and unmarshaling is working so Bindy configuration using annotations is ok.

Here is the route:

from("file://data/inbox?noop=true&maxMessagesPerPoll=1&delay=5000")
  .unmarshal()
  .bindy(BindyType.Csv, "com.ess.myapp.core")           
  .to("jms:rawTraffic");

Environment is: Eclipse Indigo, Maven 3.0.3, Camel 2.8.0

Thank you

like image 740
Taka Avatar asked Nov 14 '11 14:11

Taka


People also ask

How do I process a large csv file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

Is Apache Camel still relevant?

Many open source projects and closed source technologies did not withstand the tests of time and have disappeared from the middleware stacks for good. After a decade, however, Apache Camel is still here and becoming even stronger for the next decade of integration.

How does Python handle large CSV files?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

What is Camel File?

Apache Camel is messaging technology glue with routing. It joins together messaging start and end points allowing the transference of messages from different sources to different destinations. For example: JMS->JSON, HTTP->JMS or funneling FTP->JMS, HTTP->JMS, JMS=>JSON.


2 Answers

If you use the Splitter EIP then you can use streaming mode which means Camel will process the file on a row by row basis.

from("file://data/inbox?noop=true&maxMessagesPerPoll=1&delay=5000")
  .split(body().tokenize("\n")).streaming()
    .unmarshal().bindy(BindyType.Csv, "com.ess.myapp.core")           
    .to("jms:rawTraffic");
like image 164
Claus Ibsen Avatar answered Sep 29 '22 02:09

Claus Ibsen


For the record and for other users which might have searched for this as much as me, meanwhile there seems to be an easier method which also works well with useMaps:

CsvDataFormat csv = new CsvDataFormat()
    .setLazyLoad(true)
    .setUseMaps(true);

from("file://data/inbox?noop=true&maxMessagesPerPoll=1&delay=5000")
    .unmarshal(csv)
    .split(body()).streaming()
    .to("log:mappedRow?multiline=true");
like image 36
C12Z Avatar answered Sep 29 '22 03:09

C12Z