I'm trying to solve a (simplified here) problem in Spark Streaming: Let's say I have a log of events made by users, where each event is a tuple (user name, activity, time), e.g.:
("user1", "view", "2015-04-14T21:04Z")
("user1", "click", "2015-04-14T21:05Z")
Now I would like to gather events by user to do some analysis of that. Let's say that output is some analysis of:
("user1", List(("view", "2015-04-14T21:04Z"),("click", "2015-04-14T21:05Z"))
The events should be kept for even 2 months. During that time there might be around 500 milion of such events, and millions of unique users, which are keys here.
My questions are:
updateStateByKey
on DStream, when I have millions of keys stored?DStream.window
is no use here, when I have 2 months length window and would like to have a slide of few seconds?P.S.
I found out, that updateStateByKey
is called on all the keys on every slide, so that means it will be called millions of time every few seconds. That makes me doubt in this design and I'm rather thinking about alternative solutions like:
I think it depends on how you query the data in the future. I have the similar scenarios. I just made the transformation through mapPartitions and reduceByKey and store the data in Cassandra.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With