Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing Spark SQL RDD tables through the Thrift Server

I have registered a temporary table with Spark SQL, as described in [this section]:

people.registerTempTable("people")
// I can run queries on it all right.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

Now I want to access this table remotely through JDBC. I start up the Thrift Server as described in [this other section].

./sbin/start-thriftserver.sh --master spark://same-master-as-above:7077

But the table is not visible.

0: jdbc:hive2://localhost:10000> show tables;         
+---------+
| result  |
+---------+
+---------+
No rows selected (2.216 seconds)

I guess this is because the table is "temporary" (i.e. tied to the lifetime of the SqlContext object). But how do I make non-temporary tables?

I can see Hive tables through the Thrift Server, but I don't see how I could expose an RDD like this. I've found a comment that suggests I cannot.

Or should I run the Thrift Server in my application with my own SqlContext? Almost all classes around it are private, and this code is not in Maven Central (as far as I see). Am I supposed to use HiveThriftServer2.startWithContext? It's undocumented and @DeveloperApi, but might work.

like image 346
Daniel Darabos Avatar asked Nov 24 '14 15:11

Daniel Darabos


1 Answers

Most probably you have already solved this problem. But I tried a similar use case recently and wanted to share my findings. To expose Spark data over JDBC, you need to do the following:

  • Start the thrift-server that comes with Spark (I used version 1.3.1), the thrift-server with Hive might as well work but I haven't tested that

    /opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001

Ensure you don't give "localhost" but the actual ip-address/dnsname of the server to "hive.server2.thrift.bind.host", otherwise you cant connect to this thrift server from a different host

  • beeline is a simple JDBC client that comes with Hive & Spark. Start beeline and connect it to your thrift server

SPARK_HOME/bin/beeline

beeline> !connect jdbc:hive2://spark-master:10001

  • you can use any hiveql here to load the data you want into a table, but the sql will be executed in Spark cluster

    create temporary table tweets using org.apache.spark.sql.parquet options (path 'maprfs:///path-to-your-parquet-files/');

  • cache the table in spark memory

cache table tweets

  • thats it, all your data is now cached in the spark cluster and you can query this with low latency over remote jdbc

  • the jdbc coding is exactly the same as writing any hive jdbc code because the thrift server understand the hiveql and translates that to spark sql behind. Look here for hive jdbc examples:

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLs

For username, you can give any linux username in the thrift-server machine and leave password blank (this is non-secured mode, you can make things more secured if needed)

  • the table "tweets" will not be visible with "show tables" etc because the table is not registered in the hive metastore but you can do all sql queries with the table over jdbc

  • you can use Spark to cache data in memory and query that over remote jdbc/odbc. This will yield low latency responses, I got <2 secs responses for data size of about 40 GB. But, the caching will not work for really big data, like tera bytes. In that case, avoid the "cache table ..." command and you can still use the remote jdbc/odbc, but responses will not be in the low latency range

All the best!

MK

like image 136
M.K Avatar answered Nov 09 '22 07:11

M.K