Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explode array data into rows in spark [duplicate]

I have a dataset in the following way:

FieldA    FieldB    ArrayField 1         A         {1,2,3} 2         B         {3,5} 

I would like to explode the data on ArrayField so the output will look in the following way:

FieldA    FieldB    ExplodedField 1         A         1 1         A         2 1         A         3 2         B         3 2         B         5 

I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields.

How would you implement it in Spark. Notice that the input dataset is very large.

like image 950
Gluz Avatar asked Jun 08 '17 13:06

Gluz


People also ask

How do you explode an array in spark?

Spark SQL explode function is used to create or split an array or map DataFrame columns to rows. Spark defines several flavors of this function; explode_outer – to handle nulls and empty, posexplode – which explodes with a position of element and posexplode_outer – to handle nulls.

How do you flatten an array in PySpark?

If you want to flatten the arrays, use flatten function which converts array of array columns to a single array on DataFrame.

What does explode function do in spark?

explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.


1 Answers

The explode function should get that done.

pyspark version:

>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"]) >>> from pyspark.sql.functions import explode >>> df.withColumn("col3", explode(df.col3)).show() +----+----+----+ |col1|col2|col3| +----+----+----+ |   1|   A|   1| |   1|   A|   2| |   1|   A|   3| |   2|   B|   3| |   2|   B|   5| +----+----+----+ 

Scala version

scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3") df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]  scala> df.withColumn("col3", explode($"col3")).show() +----+----+----+ |col1|col2|col3| +----+----+----+ |   1|   A|   1| |   1|   A|   2| |   1|   A|   3| |   2|   B|   3| |   2|   B|   5| +----+----+----+ 
like image 102
rogue-one Avatar answered Sep 22 '22 00:09

rogue-one