Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read ORC files directly from Spark shell

I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).

I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .

from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)

inputRead = sc.hadoopFile("hdfs://user@server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])

I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use

inputRead.first()

the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.

I appreciate any help!

like image 789
mslick3 Avatar asked Jun 11 '15 22:06

mslick3


People also ask

How do I read an ORC file in Spark?

Use Spark DataFrameReader's orc() method to read ORC file into DataFrame. This supports reading snappy, zlib or no compression, it is not necessary to specify in compression option while reading a ORC file.

Does spark support ORC file format?

Spark on HDP supports the Optimized Row Columnar ("ORC") file format, a self-describing, type-aware column-based file format that is one of the primary file formats supported in Apache Hive. The columnar format lets the reader read, decompress, and process only the columns that are required for the current query.

How do I view ORC files?

To read ORC files, use the OrcFile class to create a Reader that contains the metadata about the file. There are a few options to the ORC reader, but far fewer than the writer and none of them are required. The reader has methods for getting the number of rows, schema, compression, etc. from the file.


2 Answers

In Spark 1.5, I'm able to load my ORC file as:

val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show
like image 68
Sudheer Palyam Avatar answered Oct 06 '22 00:10

Sudheer Palyam


You can try this code, it's working for me.

val LoadOrc = spark.read.option("inferSchema", true).orc("filepath")
LoadOrc.show()
like image 28
Suman M Avatar answered Oct 06 '22 00:10

Suman M