Read ORC files directly from Spark shell

Tags:

I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).

I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .

from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)

inputRead = sc.hadoopFile("hdfs://user@server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])

I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use

inputRead.first()

the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.

I appreciate any help!

789

asked Jun 11 '15 22:06

mslick3

2 Answers

In Spark 1.5, I'm able to load my ORC file as:

val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show

answered Oct 06 '22 00:10

Sudheer Palyam

You can try this code, it's working for me.

val LoadOrc = spark.read.option("inferSchema", true).orc("filepath")
LoadOrc.show()

answered Oct 06 '22 00:10

Suman M

Related questions
                            
                                How should I specify the type of JSON-like unstructured data in Scala?
                            
                                Scala: groupBy (identity) of List Elements
                            
                                How to set default dependencies for all subprojects in SBT?
                            
                                Program works when run with scala, get compile errors when try to compile it with scalac
                            
                                Can I make "public val" but "private var" in Scala in one line?
                            
                                Iterate over arbitrary-length tuple
                            
                                Will tuple unpacking be directly supported in parameter lists in Scala?
                            
                                In Scala, how do I pass import statements through to subclasses?
                            
                                Is it possible to use continuations to make foldRight tail recursive?
                            
                                What is meant by 'MyType = Int => Boolean'
                            
                                SocketTimeoutException when I use Scalaj request
                            
                                How do you create scala anonymous function with multiple implicit parameters
                            
                                Thread-safely transforming a value in a mutable map
                            
                                stacking multiple traits in akka Actors
                            
                                List foldRight Always Using foldLeft?
                            
                                Using implicit class to override method
                            
                                Monte Carlo calculation of Pi in Scala
                            
                                object scala in compiler mirror not found - running Scala compiler programmatically
                            
                                Resolving Akka futures from ask in the event of a failure
                            
                                How to install older version of sbt?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read ORC files directly from Spark shell

Tags:

scala

apache-spark

hadoop

pyspark

hive

mslick3

People also ask

2 Answers

Sudheer Palyam

Suman M

Recent Activity

Donate For Us