Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.

I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?

like image 563
Meredith L. Patterson Avatar asked Aug 02 '09 00:08

Meredith L. Patterson


3 Answers

The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).

Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.

Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.

like image 96
SquareCog Avatar answered Oct 19 '22 19:10

SquareCog


What about http://s4.io/. It's made for processing streaming data.

Update

A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

like image 32
DrDol Avatar answered Oct 19 '22 17:10

DrDol


I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).

like image 27
Andrei Savu Avatar answered Oct 19 '22 19:10

Andrei Savu