I want to save RDD
as a parquet file. To do this, I pass RDD to DataFrame
and then I use a structure to save DataFrame
as a parquet file:
val aStruct = new StructType(Array(StructField("id",StringType,nullable = true),
StructField("role",StringType,nullable = true)))
val newDF = sqlContext.createDataFrame(filtered, aStruct)
The question is how to create aStruct
automatically for all columns assuming that all of them are StringType
? Also, what is the meaning of nullable = true
? Does it mean that all empty values will be substituted by Null
?
Why not use the built-in toDF
?
scala> val myRDD = sc.parallelize(Seq(("1", "roleA"), ("2", "roleB"), ("3", "roleC")))
myRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[60] at parallelize at <console>:27
scala> val colNames = List("id", "role")
colNames: List[String] = List(id, role)
scala> val myDF = myRDD.toDF(colNames: _*)
myDF: org.apache.spark.sql.DataFrame = [id: string, role: string]
scala> myDF.show
+---+-----+
| id| role|
+---+-----+
| 1|roleA|
| 2|roleB|
| 3|roleC|
+---+-----+
scala> myDF.printSchema
root
|-- id: string (nullable = true)
|-- role: string (nullable = true)
scala> myDF.write.save("myDF.parquet")
The nullable=true
simply means that the specified column can contain null
values (this is esp. useful for int
columns which would normally not have a null
value -- Int
has no NA
or null
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With