I have registered a temporary table with Spark SQL, as described in [this section]:
people.registerTempTable("people")
// I can run queries on it all right.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
Now I want to access this table remotely through JDBC. I start up the Thrift Server as described in [this other section].
./sbin/start-thriftserver.sh --master spark://same-master-as-above:7077
But the table is not visible.
0: jdbc:hive2://localhost:10000> show tables;
+---------+
| result |
+---------+
+---------+
No rows selected (2.216 seconds)
I guess this is because the table is "temporary" (i.e. tied to the lifetime of the SqlContext
object). But how do I make non-temporary tables?
I can see Hive tables through the Thrift Server, but I don't see how I could expose an RDD like this. I've found a comment that suggests I cannot.
Or should I run the Thrift Server in my application with my own SqlContext
? Almost all classes around it are private
, and this code is not in Maven Central (as far as I see). Am I supposed to use HiveThriftServer2.startWithContext
? It's undocumented and @DeveloperApi
, but might work.
Most probably you have already solved this problem. But I tried a similar use case recently and wanted to share my findings. To expose Spark data over JDBC, you need to do the following:
Start the thrift-server that comes with Spark (I used version 1.3.1), the thrift-server with Hive might as well work but I haven't tested that
/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001
Ensure you don't give "localhost" but the actual ip-address/dnsname of the server to "hive.server2.thrift.bind.host", otherwise you cant connect to this thrift server from a different host
SPARK_HOME/bin/beeline
beeline> !connect jdbc:hive2://spark-master:10001
you can use any hiveql here to load the data you want into a table, but the sql will be executed in Spark cluster
create temporary table tweets using org.apache.spark.sql.parquet options (path 'maprfs:///path-to-your-parquet-files/');
cache the table in spark memory
cache table tweets
thats it, all your data is now cached in the spark cluster and you can query this with low latency over remote jdbc
the jdbc coding is exactly the same as writing any hive jdbc code because the thrift server understand the hiveql and translates that to spark sql behind. Look here for hive jdbc examples:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLs
For username, you can give any linux username in the thrift-server machine and leave password blank (this is non-secured mode, you can make things more secured if needed)
the table "tweets" will not be visible with "show tables" etc because the table is not registered in the hive metastore but you can do all sql queries with the table over jdbc
you can use Spark to cache data in memory and query that over remote jdbc/odbc. This will yield low latency responses, I got <2 secs responses for data size of about 40 GB. But, the caching will not work for really big data, like tera bytes. In that case, avoid the "cache table ..." command and you can still use the remote jdbc/odbc, but responses will not be in the low latency range
All the best!
MK
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With