Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table:

my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')

How do I invoke SparkSQL above, so it can return something like:

'select col_A, col_B from my_table'
like image 209
Edamame Avatar asked Dec 21 '16 02:12

Edamame


People also ask

Can we read a Parquet file?

We can always read the parquet file to a dataframe in Spark and see the content. They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.

How do I read a Parquet file from HDFS spark?

Use textFile() and wholeTextFiles() method of the SparkContext to read files from any file system and to read from HDFS, you need to provide the hdfs path as an argument to the function. If you wanted to read a text file from an HDFS into DataFrame.

Can you use spark SQL to read a Parquet data?

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.


2 Answers

After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)  val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")  df.printSchema  // after registering as a table you will be able to run sql queries df.registerTempTable("people")  sqlContext.sql("select * from people").collect.foreach(println) 
like image 146
bob Avatar answered Sep 18 '22 13:09

bob


With plain SQL

JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.

//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate

spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
   .show()
like image 37
mrsrinivas Avatar answered Sep 17 '22 13:09

mrsrinivas