In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format? <pre class="prettyprint"><code>def main(args: Array[String]) { println("Creating Orc File!") val sparkConf = new SparkConf().setAppName("orcfile") val sc = new SparkContext(sparkConf) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val people = sc.textFile("/apps/testdata/people.txt") val schemaString = "name age" val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)})) val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim))) //# Infer table schema from RDD** val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, schema) //# Create a table from schema** peopleSchemaRDD.registerTempTable("people") val results = hiveContext.sql("SELECT * FROM people") results.map(t => "Name: " + t.toString).collect().foreach(println) // Now I want to save this Dataframe(peopleSchemaRDD) in ORC Format. How do I do that? </code></pre> }

Since Spark 1.4 you can simply use <code>DataFrameWriter</code> and set <code>format</code> to <code>orc</code>: <pre class="prettyprint"><code>peopleSchemaRDD.write.format("orc").save("people") </code></pre> or <pre class="prettyprint"><code>peopleSchemaRDD.write.orc("people") </code></pre>

Spark: Save Dataframe in ORC format

Tags:

scala

apache-spark

apache-spark-sql

orc

In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format?

def main(args: Array[String]) {
println("Creating Orc File!")
val sparkConf = new SparkConf().setAppName("orcfile")
val sc = new SparkContext(sparkConf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

val people = sc.textFile("/apps/testdata/people.txt")
val schemaString = "name age"
val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)}))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim)))

//# Infer table schema from RDD**
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, schema)

//# Create a table from schema**
peopleSchemaRDD.registerTempTable("people")
val results = hiveContext.sql("SELECT * FROM people")
results.map(t => "Name: " + t.toString).collect().foreach(println)

// Now I want to save this Dataframe(peopleSchemaRDD) in ORC Format. How do I do that?

}

371

asked Sep 16 '15 19:09

DilTeam

1 Answers

Since Spark 1.4 you can simply use DataFrameWriter and set format to orc:

peopleSchemaRDD.write.format("orc").save("people")

peopleSchemaRDD.write.orc("people")

126

answered Sep 22 '22 13:09

zero323

Related questions
                            
                                Mere presence of implicit conversion makes the program compile despite never being applied
                            
                                How to get the class of a singleton object at compile time?
                            
                                Is it possible to emulate Scala's traits in Python?
                            
                                Problem with Scala's getter/setters
                            
                                How does the Scala compiler handle concrete trait methods?
                            
                                Unresolved dependency on sbt-android-plugin 0.6.0-SNAPSHOT?
                            
                                Force accessing of a def
                            
                                Why isn't there an orElse method on PartialFunction that accepts a total function?
                            
                                Case class and Linearization of traits
                            
                                Using HAML, Scaml or Jade with Play
                            
                                Scala Extractor with Argument
                            
                                PostgreSQL Evolutions: "PSQLException: FATAL: sorry, too many clients already"
                            
                                spark streaming fileStream
                            
                                Scala equivalent of Python help()
                            
                                How to pattern-match against every numeric class in one case?
                            
                                Play: How to transform JSON while writing/reading it to/from MongoDB
                            
                                What is the efficient way to update value inside Spark's RDD?
                            
                                Modifying case class constructor parameter before setting value
                            
                                What does the following warning mean: 'side-effecting nullary methods are discouraged'?
                            
                                In akka-stream how to create a unordered Source from a futures collection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With