Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Apache spark as Query Engine?

i am using Apache Spark For Big data Processing. The data is loaded to Data frames from a Flat file source or JDBC source. The Job is to search specific records from the data frame using spark sql.

So i have to Run the job again and again for new search terms. every time i have to submit the Jar files using spark submit to get the results. As the size of data is 40.5 GB it becomes tedious to reload the same data every time to data frame to get the results for different queries.

so What i need is,

  • a way if i can load the data in data frame once and query it multiple time with out submitting the jar multiple times ?
  • if we could use spark as a search engine/ query engine?
  • if we can load the data into data frame once and query the data frame remotely using RestAP

> The current configuration of My Spark Deployment is

  • 5 node cluster.
  • runs on yarn rm.

i have tried to use spark-job server but it also runs the job every time.

like image 362
PradhanKamal Avatar asked Oct 31 '25 12:10

PradhanKamal


1 Answers

You might be interested in HiveThriftServer and Spark integration.

Basically you start a Hive Thrift Server and inject your HiveContext build from SparkContext:

...
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10001")
...
dataFrame.registerTempTable("myTable")
HiveThriftServer2.startWithContext(sql)
...

There are several client libraries and tools to query the server: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

Including CLI tool - beeline

Reference: https://medium.com/@anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.3ntbhdxvr

like image 67
Piotr Reszke Avatar answered Nov 03 '25 10:11

Piotr Reszke



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!