Compare in-memory cluster computing systems

Tags:

I am working on Spark(Berkeley) Cluster Computing System. On my research, I learnt about some other in-memory systems like Redis, Memcachedb etc. It would be great if someone could give me a comparison between SPARK and REDIS (and MEMCACHEDB). In what scenarios does Spark have an advantage over these other in-memory systems?

910

asked May 22 '13 06:05

void

1 Answers

They are complete different beasts.

Redis and memcachedb are distributed stores. Redis is a pure in-memory system with optional persistency featuring various data structures. Memcachedb provides a memcached API on top of Berkeley-DB. In both cases, they are more likely to be used by OLTP applications, or eventually, for simple real-time analytics (on-the-fly aggregation of data).

Both Redis and memcachedb lack mechanisms to efficiently iterate on the stored data in parallel. You cannot easily scan and apply some processing to the stored data. They are not designed for this. Also, except by using client-side manual sharding, they cannot be scaled out in a cluster (a Redis cluster implementation is on-going though).

Spark is a system to expedite large scale analytics jobs (and especially the iterative ones) by providing in-memory distributed datasets. With Spark, you can implement efficient iterative map/reduce jobs on a cluster of machines.

Redis and Spark both rely on in-memory data management. But Redis (and memcached) play in the same ballpark as the other OLTP NoSQL stores, while Spark is rather similar to an Hadoop map/reduce system.

Redis is good at running numerous fast storage/retrieval operations at a high throughput with sub-millisecond latency. Spark shines at implementing large scale iterative algorithms for machine learning, graph analysis, interactive data mining, etc ... on a significant volume of data.

Update: additional question about Storm

The question is to compare Spark to Storm (see comments below).

Spark is still based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data, rather than moving the data to the process. Each node stores (or caches) its dataset, and jobs are submitted to the nodes. So the process moves to the data. It is very similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms (when the output of the previous step is the input of the next step). Shark is only a query engine built on top of Spark (supporting ad-hoc analytical queries).

You can see Storm as the complete architectural opposite of Spark. Storm is a distributed streaming engine. Each node implements a basic process, and data items flow in/out a network of interconnected nodes (contrary to Spark). With Storm, the data move to the process.

Both frameworks are used to parallelize computations of massive amount of data.

However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).

Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.

174

answered Sep 27 '22 17:09

Didier Spezia

Related questions
                            
                                PCA Analysis in PySpark
                            
                                Create Spark Dataset from a CSV file
                            
                                How can I combine(concatenate) two data frames with the same column name in java
                            
                                Cannot resolve column (numeric column name) in Spark Dataframe
                            
                                How to convert date to the first day of month in a PySpark Dataframe column?
                            
                                Spark DataFrame Repartition and Parquet Partition
                            
                                How to use spark to generate huge amount of random integers?
                            
                                How to remove parentheses around records when saveAsTextFile on RDD[(String, Int)]?
                            
                                How to read whole file in one string
                            
                                Spark Multiclass Classification Example
                            
                                Apache Spark upgrade from 1.5.2 to 1.6.0 using homebrew leading to permission denied error during execution
                            
                                Multiple SparkContext detected in the same JVM
                            
                                How can I sum multiple columns in a spark dataframe in pyspark?
                            
                                How to set column names to toDF() function in spark dataframe using a string array?
                            
                                Creating a row number of each row in PySpark DataFrame using row_number() function with Spark version 2.2
                            
                                What is the Scala type mapping for all Spark SQL DataType
                            
                                Spark job in Java: how to access files from 'resources' when run on a cluster
                            
                                How to copy and convert parquet files to csv
                            
                                Create array of literals and columns from List of Strings in Spark SQL
                            
                                How to convert Row to json in Spark 2 Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compare in-memory cluster computing systems

Tags:

redis

apache-spark

apache-storm

memcachedb

void

People also ask

1 Answers

Didier Spezia

Recent Activity

Donate For Us