Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

Several postings on stackoverflow has responses with partial information about How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine. So I'd like to ask the following questions for complete information about how to do that:

  1. In the Spark SQL app, do we need to use HiveContext to register tables? Or can we use just SQL Context?

  2. Where and how do we use HiveThriftServer2.startWithContext?

  3. When we run start-thriftserver.sh as in

/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001

besides specifying the jar and main class of the Spark SQL app, do we need to specify any other parameters?

  1. Are there any other things we need to do?

Thanks.

like image 577
Michael Avatar asked Jul 18 '15 16:07

Michael


People also ask

Does Spark SQL use RDD?

Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using Spark. At the core of this component is a new type of RDD, SchemaRDD. SchemaRDDs are composed Row objects along with a schema that describes the data types of each column in the row.

How you will convert RDD into data frame and Datasets?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

Can Spark SQL read data from other databases?

Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.


1 Answers

To expose DataFrame temp tables through HiveThriftServer2.startWithContext(), you may need to write and run a simple application, may not need to run start-thriftserver.sh.

To your questions:

  1. HiveContext is needed; the sqlContext converted to HiveContext implicitly in spark-shell

  2. Write a simple application, example :

    import  org.apache.spark.sql.hive.thriftserver._  
    val  hiveContext  =  new  HiveContext(sparkContext)
    hiveContext.parquetFile(path).registerTempTable("my_table1")
      HiveThriftServer2.startWithContext(hiveContext)
  1. No need to run start-thriftserver.sh, but run your own application instead, e.g.:

spark-submit --class com.xxx.MyJdbcApp ./package_with_my_app.jar

  1. Nothing else from server side, should start on default port 10000; you may verify by connecting to the server with beeline.
like image 99
Haiying Wang Avatar answered Sep 20 '22 12:09

Haiying Wang