Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to automate StructType creation for passing RDD to DataFrame

I want to save RDD as a parquet file. To do this, I pass RDD to DataFrame and then I use a structure to save DataFrame as a parquet file:

    val aStruct = new StructType(Array(StructField("id",StringType,nullable = true),
                                       StructField("role",StringType,nullable = true)))
    val newDF = sqlContext.createDataFrame(filtered, aStruct)

The question is how to create aStruct automatically for all columns assuming that all of them are StringType? Also, what is the meaning of nullable = true? Does it mean that all empty values will be substituted by Null?

like image 423
duckertito Avatar asked Nov 15 '16 15:11

duckertito


1 Answers

Why not use the built-in toDF?

scala> val myRDD = sc.parallelize(Seq(("1", "roleA"), ("2", "roleB"), ("3", "roleC")))
myRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[60] at parallelize at <console>:27

scala> val colNames = List("id", "role")
colNames: List[String] = List(id, role)

scala> val myDF = myRDD.toDF(colNames: _*)
myDF: org.apache.spark.sql.DataFrame = [id: string, role: string]

scala> myDF.show
+---+-----+
| id| role|
+---+-----+
|  1|roleA|
|  2|roleB|
|  3|roleC|
+---+-----+

scala> myDF.printSchema
root
 |-- id: string (nullable = true)
 |-- role: string (nullable = true)

scala> myDF.write.save("myDF.parquet")

The nullable=true simply means that the specified column can contain null values (this is esp. useful for int columns which would normally not have a null value -- Int has no NA or null).

like image 59
evan.oman Avatar answered Sep 27 '22 23:09

evan.oman