How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

Tags:

Several postings on stackoverflow has responses with partial information about How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine. So I'd like to ask the following questions for complete information about how to do that:

In the Spark SQL app, do we need to use HiveContext to register tables? Or can we use just SQL Context?
Where and how do we use HiveThriftServer2.startWithContext?
When we run start-thriftserver.sh as in

/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001

besides specifying the jar and main class of the Spark SQL app, do we need to specify any other parameters?

Are there any other things we need to do?

Thanks.

577

asked Jul 18 '15 16:07

Michael

1 Answers

To expose DataFrame temp tables through HiveThriftServer2.startWithContext(), you may need to write and run a simple application, may not need to run start-thriftserver.sh.

To your questions:

HiveContext is needed; the sqlContext converted to HiveContext implicitly in spark-shell
Write a simple application, example :

    import  org.apache.spark.sql.hive.thriftserver._  
    val  hiveContext  =  new  HiveContext(sparkContext)
    hiveContext.parquetFile(path).registerTempTable("my_table1")
      HiveThriftServer2.startWithContext(hiveContext)

No need to run start-thriftserver.sh, but run your own application instead, e.g.:

spark-submit --class com.xxx.MyJdbcApp ./package_with_my_app.jar

Nothing else from server side, should start on default port 10000; you may verify by connecting to the server with beeline.

answered Sep 20 '22 12:09

Haiying Wang

Related questions
                            
                                Speed up InMemoryFileIndex for Spark SQL job with large number of input files
                            
                                Spark SQL: using collect_set over array values?
                            
                                How to get datediff() in seconds in pyspark?
                            
                                PySpark: ModuleNotFoundError: No module named 'app'
                            
                                Spark FileAlreadyExistsException on Stage Failure
                            
                                Converting a list of rows to a PySpark dataframe
                            
                                Scheduling Spark Jobs Running on Kubernetes via Airflow
                            
                                How to normalize and create similarity matrix in Pyspark?
                            
                                What is the difference between using df.as[T] and df.asInstanceOf[Dataset[T]]?
                            
                                Map function of RDD not being invoked in Scala Spark
                            
                                Scala Spark: Split collection into several RDD?
                            
                                Spark Python Performance Tuning
                            
                                How to create multiple SparkContexts in a console
                            
                                PySpark error: "Input path does not exist"
                            
                                Remotely execute a Spark job on an HDInsight cluster
                            
                                Periodic Broadcast in Apache Spark Streaming
                            
                                unable to add spark to PYTHONPATH
                            
                                java.lang.ClassNotFoundException,when I use "spark-submit" with a new class name rather than "SimpleApp",
                            
                                Programmatically determine number of cores and amount of memory available to Spark
                            
                                Is it possible for multiple Executors to be launched within a single Spark worker for one Spark Application?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

Tags:

apache-spark

apache-spark-sql

Michael

People also ask

1 Answers

Haiying Wang

Recent Activity

Donate For Us