Cassandra + Solr/Hadoop/Spark - Choosing the right tools

Tags:

I'm currently investigating how to store and analyze enriched time based data with up to 1000 columns per line. At the moment Cassandra together with either Solr, Hadoop or Spark offered by Datastax Enterprise seem to fulfill my requirements on the rough. But the devil is in the detail.

Out of the 1000 columns about 60 are used for real-time-like queries (web-frontend, user sends form and expect quick response). These queries are more or less GROUPBY statements where the number or occurrences are counted.

As Cassandra itself does not provide the required analytical capabilities (no GROUPBY), I'm left these alternatives:

Roughly query via Cassandra and filter the resultset within self-written code
Index the data with Solr and run facet.pivot queries
Use either Hadoop or Spark and run the queries

The first approach seems cumbersome and prone to errors… Solr does have some anayltic features but without multifield grouping I'm stuck with pivots. I don't know whether this is a good or performant approach though… Last but not least there are Hadoop and Spark, the prior known not to be the best for real-time queries, the later pretty new and maybe not production ready.

So which way to go? There is no one-fits-all here, but before I go one way through I'd like to get some feedback. Maybe I'm thinking to complex or my expectations are too high :S

Thanks in advance,

Arman

529

asked Mar 30 '14 14:03

Arman

2 Answers

In a place I work now we have a similar set of tech requirements and a solution is Cassandra-Solr-Spark, exactly in that order.

So if a query can be "covered" by Cassandra indices - good, if not - it's covered by Solr. For testing & less often queries - Spark (Scala, no SparkSQL due to old version of it -- it's a bank, everything should be tested and matured, from cognac to software, argh).

Generally I agree with the solution, though sometimes I have a feeling that some client's requests should NOT be taken seriously at all, saving us from loads of weird queries :)

184

answered Sep 21 '22 15:09

aleck

I would recommend Spark, if you take a loot at the list of companies using it you'll such names as Amazon, eBay and Yahoo!. Also, as you noted in the comment, it's becoming a mature tool.

You've given arguments against Cassandra and Solr already, so I'll focus on explaining why Hadoop MapReduce wouldn't do as well as Spark for real-time queries.

Hadoop and MapReduce were designed to leverage hard disk under the assumption that for big data IO is negligible. As a result data are read and wrote at least twice - in map stage and in reduce stage. This allows you to recover from failures as partial result are secured but it that's not want you want when aiming for real-time queries.

Spark not only aims to fix MapReduce shortcomings, it also focuses on interactive data analysis, which is exactly what you want. This goal is achieved mainly by utilizing RAM and the results are astonishing. Spark jobs will often be 10-100 times faster than MapReduce equivalents.

The only caveat is the amount of memory you have. Most probably your data is probably going to feat in the RAM you can provide or you can rely on sampling. Usually when interactively working with data there is no real need to use MapReduce and it seems to be so in your case.

answered Sep 25 '22 15:09

GallantQuail

Related questions
                            
                                hadoop/yarn and task parallelization on non-hdfs filesystems
                            
                                Error on running multiple Workflow in OOZIE-4.1.0
                            
                                JAVA_HOME error with upgrade to Spark 1.3.0
                            
                                How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez?
                            
                                Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS?
                            
                                Loading data from RDBMS to Hadoop with multiple destinations
                            
                                Read data from remote hive on spark over JDBC returns empty result
                            
                                How to speedup my tensorflow execution on hadoop?
                            
                                Re-run Spark jobs on Failure or Abort
                            
                                Flink - No FileSystem for scheme: hdfs
                            
                                Spark and Hive in Hadoop 3: Difference between metastore.catalog.default and spark.sql.catalogImplementation
                            
                                When was the first version of Hadoop released? [closed]
                            
                                How does one implement a Hadoop Mapper in Scala 2.9.0?
                            
                                hbase.MasterNotRunningException while creating table in Hbase
                            
                                Pass directories not files to hadoop-streaming?
                            
                                Exit pig shell command safely
                            
                                What is the difference between job.submit and job.waitForComplete in Apache Hadoop?
                            
                                What is significance of the Oozie MR launcher?
                            
                                Apache Nutch: Get outlink URL's text context
                            
                                Hadoop YARN how to determine the number of containers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cassandra + Solr/Hadoop/Spark - Choosing the right tools

Tags:

cassandra

solr

apache-spark

hadoop

analytics

Arman

People also ask

2 Answers

aleck

GallantQuail

Recent Activity

Donate For Us