DataFrame partitionBy on nested columns

Tags:

I am trying to call partitionBy on a nested field like below:

val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)

I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested?

java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField(time,StringType,true), StructField(data,StructType(StructField(dataDetails,StructType(StructField(name,StringType,true), StructField(id,StringType,true),true)),true))

This is my json file:

{  
  "name": "AssetName",
  "time": "2016-06-20T11:57:19.4941368-04:00",
  "data": {
    "type": "EventData",
    "dataDetails": {
      "name": "EventName"
      "id": "1234"

    }
  }
}

997

asked Jul 12 '16 03:07

vijay

1 Answers

This appears to be a known issue listed here: https://issues.apache.org/jira/browse/SPARK-18084

I had this issue as well and to work around it I was able to un-nest the columns on my dataset. My dataset was a little different than your dataset, but here is the strategy...

Original Json:

{  
  "name": "AssetName",
  "time": "2016-06-20T11:57:19.4941368-04:00",
  "data": {
    "type": "EventData",
    "dataDetails": {
      "name": "EventName"
      "id": "1234"

    }
  }
}

Modified Json:

{  
  "name": "AssetName",
  "time": "2016-06-20T11:57:19.4941368-04:00",
  "data_type": "EventData",
  "data_dataDetails_name" : "EventName",
  "data_dataDetails_id": "1234"
  }
}

Code to get to Modified Json:

def main(args: Array[String]) {
  ...

  val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)

  data.printSchema

  data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}

def children(colname: String, df: DataFrame) = {
  val parent = df.schema.fields.filter(_.name == colname).head
  val fields = parent.dataType match {
    case x: StructType => x.fields
    case _ => Array.empty[StructField]
  }
  fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}

answered Sep 28 '22 18:09

satoukum

Related questions
                            
                                Is it possible to configure Apache Livy to run with Spark Standalone?
                            
                                Spark DStream periodically call saveAsObjectFile using transform does not work as expected
                            
                                Apply sklearn trained model on a dataframe with PySpark
                            
                                Spark: Exception in thread "main" org.apache.spark.sql.catalyst.errors.package
                            
                                Reading csv files with missing columns and random column order
                            
                                Best approach to check if Spark streaming jobs are hanging
                            
                                Spark Structured Streaming with Kafka doesn't honor startingOffset="earliest"
                            
                                Why Parquet over some RDBMS like Postgres
                            
                                How to run inference of a pytorch model on pyspark dataframe (create new column with prediction) using pandas_udf?
                            
                                Hadoop + Spark: There are 1 datanode(s) running and 1 node(s) are excluded in this operation
                            
                                how to use sparks implicit conversion (e.g. $) in IntelliJ debugger evaluate expression
                            
                                Connection Refused When Running SparkPi Locally
                            
                                Spark: PageRank example when iteration too large throws stackoverflowError
                            
                                Saving a >>25T SchemaRDD in Parquet format on S3
                            
                                How to use the RangePartitioner in Spark
                            
                                Spark and HBase Snapshots
                            
                                spark 1.4.0 java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
                            
                                Pyspark: shuffle RDD
                            
                                VectorAssembler output only to DenseVector?
                            
                                Spark - Shuffle Read Blocked Time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DataFrame partitionBy on nested columns

Tags:

apache-spark

apache-spark-sql

spark-dataframe

vijay

People also ask

1 Answers

satoukum

Recent Activity

Donate For Us