I am working on Spark(Berkeley) Cluster Computing System. On my research, I learnt about some other in-memory systems like Redis, Memcachedb etc. It would be great if someone could give me a comparison between SPARK and REDIS (and MEMCACHEDB). In what scenarios does Spark have an advantage over these other in-memory systems?
In-memory computing means using a type of middleware software that allows one to store data in RAM, across a cluster of computers, and process it in parallel. Consider operational datasets typically stored in a centralized database which you can now store in “connected” RAM across multiple computers.
In contrast to the traditional computing paradigm of moving data to a separate database, processing it and then saving it back to the data store, with In-Memory Computing everything can be placed in an in-memory data grid and distributed across a horizontally scalable architecture.
In-memory computing (IMC) stores data in RAM rather than in databases hosted on disks.
CCS (Cluster Computing System) is coming to solve the problems of standard technology. Whose, objective is to improve the performance/power efficiency of a single processor for storing and mining the large data sets, using the parallel programming to read and process the massive data sets on multiple disks and CPUs.
They are complete different beasts.
Redis and memcachedb are distributed stores. Redis is a pure in-memory system with optional persistency featuring various data structures. Memcachedb provides a memcached API on top of Berkeley-DB. In both cases, they are more likely to be used by OLTP applications, or eventually, for simple real-time analytics (on-the-fly aggregation of data).
Both Redis and memcachedb lack mechanisms to efficiently iterate on the stored data in parallel. You cannot easily scan and apply some processing to the stored data. They are not designed for this. Also, except by using client-side manual sharding, they cannot be scaled out in a cluster (a Redis cluster implementation is on-going though).
Spark is a system to expedite large scale analytics jobs (and especially the iterative ones) by providing in-memory distributed datasets. With Spark, you can implement efficient iterative map/reduce jobs on a cluster of machines.
Redis and Spark both rely on in-memory data management. But Redis (and memcached) play in the same ballpark as the other OLTP NoSQL stores, while Spark is rather similar to an Hadoop map/reduce system.
Redis is good at running numerous fast storage/retrieval operations at a high throughput with sub-millisecond latency. Spark shines at implementing large scale iterative algorithms for machine learning, graph analysis, interactive data mining, etc ... on a significant volume of data.
Update: additional question about Storm
The question is to compare Spark to Storm (see comments below).
Spark is still based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data, rather than moving the data to the process. Each node stores (or caches) its dataset, and jobs are submitted to the nodes. So the process moves to the data. It is very similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms (when the output of the previous step is the input of the next step). Shark is only a query engine built on top of Spark (supporting ad-hoc analytical queries).
You can see Storm as the complete architectural opposite of Spark. Storm is a distributed streaming engine. Each node implements a basic process, and data items flow in/out a network of interconnected nodes (contrary to Spark). With Storm, the data move to the process.
Both frameworks are used to parallelize computations of massive amount of data.
However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).
Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With