Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access cached data in Spark Streaming application?

Tags:

I have a Kafka broker with JSON data from my IoT applications. I connect to this server from a Spark Streaming application in order to do some processing.

I'd like to save in memory (RAM) some specific fields of my json data which I believe I could achieve using cache() and persist() operators.

Next time when I receive a new JSON data in the Spark Streaming application, I check in memory (RAM) if there are fields in common that I can retrieve. And if yes, I do some simple computations and I finally update the values of fields I saved in memory (RAM).

Thus, I would like to know if what I previously descibed is possible. If yes, do I have to use cache() or persist() ? And How can I retrieve from memory my fields?

like image 329
Yassir S Avatar asked Nov 18 '16 16:11

Yassir S


1 Answers

It's possible with cache / persist which uses memory or disk for the data in Spark applications (not necessarily for Spark Streaming applications only -- it's a more general use of caching in Spark).

But...in Spark Streaming you've got special support for such use cases which are called stateful computations. See Spark Streaming Programming Guide to explore what's possible.

I think for your use case mapWithState operator is exactly what you're after.

like image 193
Jacek Laskowski Avatar answered Sep 22 '22 16:09

Jacek Laskowski