I have a dataset in the following way:
FieldA FieldB ArrayField 1 A {1,2,3} 2 B {3,5}
I would like to explode the data on ArrayField so the output will look in the following way:
FieldA FieldB ExplodedField 1 A 1 1 A 2 1 A 3 2 B 3 2 B 5
I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields.
How would you implement it in Spark. Notice that the input dataset is very large.
Spark SQL explode function is used to create or split an array or map DataFrame columns to rows. Spark defines several flavors of this function; explode_outer – to handle nulls and empty, posexplode – which explodes with a position of element and posexplode_outer – to handle nulls.
If you want to flatten the arrays, use flatten function which converts array of array columns to a single array on DataFrame.
explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.
The explode function should get that done.
pyspark version:
>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"]) >>> from pyspark.sql.functions import explode >>> df.withColumn("col3", explode(df.col3)).show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| A| 1| | 1| A| 2| | 1| A| 3| | 2| B| 3| | 2| B| 5| +----+----+----+
Scala version
scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3") df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field] scala> df.withColumn("col3", explode($"col3")).show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| A| 1| | 1| A| 2| | 1| A| 3| | 2| B| 3| | 2| B| 5| +----+----+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With