Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra + Spark for Real time analytics

I am working on an application for "Real Time Rendering of Big Data (Spatial data)". With the help of Spark Streaming + Spark SQL + WebSocket, i am able to render pre defined queries on dashboard. But i want to fetch data with interactive queries and ad hoc queries.

For that purpose i am trying to implement it with "Spark Streaming + Cassandra". These queries required aggregation and filter on huge amount of data.

I am new to Cassandra and Spark, So i am confused about below approachs, which will be better\faster:

  1. Spark Streaming -> Filtering (Spark) -> Save to Cassandra ->Interactive Query -> UI (Dashboard)
  2. Spark Streaming -> Filtering (Spark) -> Save to Cassandra ->Spark SQL -> Interactive Query -> UI (Dashboard)

Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.

One of the query is:

"SELECT *  FROM PERFORMANCE.GEONAMES A  INNER JOIN  
(SELECT max(GEONAMEID) AS MAPINFO_ID FROM  PERFORMANCE.GEONAMES
where longitude between %LL_LONG% and %UR_LONG% 
and latitude between %LL_LAT% and %UR_LAT%  
and %WHERE_CLAUSE% GROUP BY LEFT(QUADKEY, %QUAD_TREE_LEVEL%)  )
AS B ON A.GEONAMEID = B.MAPINFO_ID"

Any inputs or suggestions will be appreciated. Thanks,

Thanks @doanduyhai for suggesting SASI secondary index, it really made a huge difference.

like image 255
Ajeet Avatar asked Oct 19 '22 11:10

Ajeet


1 Answers

Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.

It depends on how much filtering you're doing up-front and the number of machines in your cluster. If your Cassandra table has 1Tb of data and you query fetches 100Gb of data in memory, assuming a cluster of 10 machines, it means loading 1Gb in memory it's manageable but the query will never be sub-minute.

Now, if you filter enough to fetch only 100Mb total out of the Cassandra table, it means 10Mb/machine and it's possible to have latency of the order of seconds.

How to filter data early in Cassandra ?

  1. Use the new SASI secondary index (wait for Cassandra 3.5 released this week because 2 critical bugs have been discovered)
  2. Use DSE Search to filter early with Solr
  3. Use Stratio Lucene secondary index
like image 172
doanduyhai Avatar answered Nov 13 '22 01:11

doanduyhai