I am working on an application for "Real Time Rendering of Big Data (Spatial data)". With the help of Spark Streaming + Spark SQL + WebSocket, i am able to render pre defined queries on dashboard. But i want to fetch data with interactive queries and ad hoc queries.
For that purpose i am trying to implement it with "Spark Streaming + Cassandra". These queries required aggregation and filter on huge amount of data.
I am new to Cassandra and Spark, So i am confused about below approachs, which will be better\faster:
Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.
One of the query is:
"SELECT * FROM PERFORMANCE.GEONAMES A INNER JOIN
(SELECT max(GEONAMEID) AS MAPINFO_ID FROM PERFORMANCE.GEONAMES
where longitude between %LL_LONG% and %UR_LONG%
and latitude between %LL_LAT% and %UR_LAT%
and %WHERE_CLAUSE% GROUP BY LEFT(QUADKEY, %QUAD_TREE_LEVEL%) )
AS B ON A.GEONAMEID = B.MAPINFO_ID"
Any inputs or suggestions will be appreciated. Thanks,
Thanks @doanduyhai for suggesting SASI secondary index, it really made a huge difference.
Will Cassandra be fast enough to give result in real time ? Or should i create an RDD from Cassandra to perform interactive queries over it.
It depends on how much filtering you're doing up-front and the number of machines in your cluster. If your Cassandra table has 1Tb of data and you query fetches 100Gb of data in memory, assuming a cluster of 10 machines, it means loading 1Gb in memory it's manageable but the query will never be sub-minute.
Now, if you filter enough to fetch only 100Mb total out of the Cassandra table, it means 10Mb/machine and it's possible to have latency of the order of seconds.
How to filter data early in Cassandra ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With