Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Phoenix vs Hive-Spark

What's faster/easier to convert into SQL, that accept SQL scripts as input: Spark SQL which comes as a layer of speed for Hive high latency queries or Phoenix? And if so, how? I need to do a lot of upserts/joining/grouping over the data. [hbase]

Is there any alternative on top of Cassandra CQL to support the above mentioned(joining/grouping in a real-time manner)?

I'm most probably bound to Spark since I would like to take advantage of MLlib. But for processing the data which should be my option to go?

Thanks, kraster

like image 940
kraster Avatar asked Apr 02 '15 05:04

kraster


2 Answers

http://phoenix-hbase.blogspot.com/ I more then sure that Phoenix on Hbase will work faster.

here is sample query and PC requirement for test Query: select count(1) from table over 10M and 100M rows. Data is 5 narrow columns. Number of Region Servers: 4 (HBase heap: 10GB, Processor: 6 cores @ 3.3GHz Xeon) enter image description here Because Phoenix use HBASE client interface to load all query, and use query engine only to map sql task for map reduce task in HBase

like image 57
Grigoriev Nick Avatar answered Oct 01 '22 05:10

Grigoriev Nick


You have a few options (from my knowledge)

  1. Apache phoenix is a good choice for low-latency and mid-size table (1M - 100M rows, but beware of tables with many columns!) processing. A great plus for phoenix is that its really easy to get started. My company already had an HBase cluster set up (with kerberos). To use Phoenix, all I needed was the HMaster URL, Hbase-site.xml and a keytab to get the operation going. Very quick reads and writes are decent (its slower for me because I needed to do it dynamically so I was force to use the Java client API instead of bulk loading)

  2. Hive with Spark is great as well. I am not sure how great of a performance is it over Phoenix. Since Spark does most things in-memory, I am assuming it should be quick. However, I can tell you if you want to expose the SQL accessing as some sort of API, using spark becomes quite difficult.

  3. Presto is a great product that offers Spark-like processing power with SQL interface that allows you to inter-connect data from many sources (Hive, Cassandra, MySQL ..etc)

Hope this helps.

like image 33
dodo Avatar answered Oct 01 '22 04:10

dodo