Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storm spout - How read all lines from text file only once, using >1 threads?

Tags:

apache-storm

A storm topology contains a Spout component which is run using >1 threads. e.g.

 builder.setSpout("lines", new TestLineSpout(), 2);

The Spout (open function) opens and reads all the lines of a text file and nextTuple emits each line to a bolt.

As 2 threads are run, for the spout, each line of the file is read twice.

I am new to storm and am wondering the best way of handling this? I could reduce the number of threads to 1 or modify the spout so that each thread reads different lines - or do (how) I need to make use of the TopologyContext parameter? am not sure if I've missed a "storm" way of imlementing this?

like image 729
Helen Reeves Avatar asked May 28 '14 01:05

Helen Reeves


1 Answers

Simon,

Storm has no functionality to read files stored on the local file system in parallel. You can write a spout that does that, but apart from small test and experimentation purposes, that would conflict with the architecture of Storm.

Here are a few pointers:

  • Storm is designed to process data stream received in real time. If you already have all your data finalized and stored somewhere, the constraints imposed by Storm will just be annoyances in your way. Batch oriented solutions like Yarn map reduce or Spark are easier.

  • Storm is meant to be distributed, with many threads per worker (VM), many workers per slave node and many (many) slave nodes. There is no concept of "a single file on the local file system" in such a distributed architecture. Also, for scalability reasons, one core idea is to have all those workers act independently without communicating with each order. That's why we typically use distributed solutions to feed data into Storm, like Kafka or 0mq.

  • The closest thing to a file on the local file system I can think of in the distributed word is an HDFS folder. The pattern is to have all producers of data write to a folder, each to a file with uniquely generated name, and a data readers consuming a folder would read all files in it, no matter their name. But again, if you go that way, traditional map reduce or spark are easier, I think.

I hope this help :D

like image 131
Svend Avatar answered Oct 23 '22 08:10

Svend