Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

I'm trying to solve a (simplified here) problem in Spark Streaming: Let's say I have a log of events made by users, where each event is a tuple (user name, activity, time), e.g.:

("user1", "view", "2015-04-14T21:04Z")
("user1", "click", "2015-04-14T21:05Z")

Now I would like to gather events by user to do some analysis of that. Let's say that output is some analysis of:

("user1", List(("view", "2015-04-14T21:04Z"),("click", "2015-04-14T21:05Z"))

The events should be kept for even 2 months. During that time there might be around 500 milion of such events, and millions of unique users, which are keys here.

My questions are:

  • Is it feasible to do such a thing with updateStateByKey on DStream, when I have millions of keys stored?
  • Am I right that DStream.window is no use here, when I have 2 months length window and would like to have a slide of few seconds?

P.S. I found out, that updateStateByKey is called on all the keys on every slide, so that means it will be called millions of time every few seconds. That makes me doubt in this design and I'm rather thinking about alternative solutions like:

  • using Cassandra for state
  • using Trident state (with Cassandra probably)
  • using Samza with its state management.
like image 529
zarzyk Avatar asked Apr 14 '15 19:04

zarzyk


1 Answers

I think it depends on how you query the data in the future. I have the similar scenarios. I just made the transformation through mapPartitions and reduceByKey and store the data in Cassandra.

like image 174
Grant Avatar answered Oct 23 '22 12:10

Grant