Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: Save Dataframe in ORC format

In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format?

def main(args: Array[String]) {
println("Creating Orc File!")
val sparkConf = new SparkConf().setAppName("orcfile")
val sc = new SparkContext(sparkConf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

val people = sc.textFile("/apps/testdata/people.txt")
val schemaString = "name age"
val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)}))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim)))

//# Infer table schema from RDD**
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, schema)

//# Create a table from schema**
peopleSchemaRDD.registerTempTable("people")
val results = hiveContext.sql("SELECT * FROM people")
results.map(t => "Name: " + t.toString).collect().foreach(println)

// Now I want to save this Dataframe(peopleSchemaRDD) in ORC Format. How do I do that?

}

like image 371
DilTeam Avatar asked Sep 16 '15 19:09

DilTeam


People also ask

How do I save a DataFrame in an ORC file?

How do I save data in DataFrame in ORC File format? def main(args: Array[String]) { println("Creating Orc File!") val sparkConf = new SparkConf(). setAppName("orcfile") val sc = new SparkContext(sparkConf) val hiveContext = new org. apache.

Does spark support ORC file format?

Spark on HDP supports the Optimized Row Columnar ("ORC") file format, a self-describing, type-aware column-based file format that is one of the primary file formats supported in Apache Hive. The columnar format lets the reader read, decompress, and process only the columns that are required for the current query.

Which is better parquet or ORC?

PARQUET is more capable of storing nested data. ORC is more capable of Predicate Pushdown. ORC supports ACID properties. ORC is more compression efficient.

What is the best format for spark storage?

The default file format for Spark is Parquet, but as we discussed above, there are use cases where other formats are better suited, including: SequenceFiles: Binary key/value pair that is a good choice for blob storage when the overhead of rich schema support is not required.


1 Answers

Since Spark 1.4 you can simply use DataFrameWriter and set format to orc:

peopleSchemaRDD.write.format("orc").save("people")

or

peopleSchemaRDD.write.orc("people")
like image 126
zero323 Avatar answered Sep 22 '22 13:09

zero323