In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format?
def main(args: Array[String]) {
println("Creating Orc File!")
val sparkConf = new SparkConf().setAppName("orcfile")
val sc = new SparkContext(sparkConf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val people = sc.textFile("/apps/testdata/people.txt")
val schemaString = "name age"
val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)}))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim)))
//# Infer table schema from RDD**
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, schema)
//# Create a table from schema**
peopleSchemaRDD.registerTempTable("people")
val results = hiveContext.sql("SELECT * FROM people")
results.map(t => "Name: " + t.toString).collect().foreach(println)
// Now I want to save this Dataframe(peopleSchemaRDD) in ORC Format. How do I do that?
}
How do I save data in DataFrame in ORC File format? def main(args: Array[String]) { println("Creating Orc File!") val sparkConf = new SparkConf(). setAppName("orcfile") val sc = new SparkContext(sparkConf) val hiveContext = new org. apache.
Spark on HDP supports the Optimized Row Columnar ("ORC") file format, a self-describing, type-aware column-based file format that is one of the primary file formats supported in Apache Hive. The columnar format lets the reader read, decompress, and process only the columns that are required for the current query.
PARQUET is more capable of storing nested data. ORC is more capable of Predicate Pushdown. ORC supports ACID properties. ORC is more compression efficient.
The default file format for Spark is Parquet, but as we discussed above, there are use cases where other formats are better suited, including: SequenceFiles: Binary key/value pair that is a good choice for blob storage when the overhead of rich schema support is not required.
Since Spark 1.4 you can simply use DataFrameWriter
and set format
to orc
:
peopleSchemaRDD.write.format("orc").save("people")
or
peopleSchemaRDD.write.orc("people")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With