Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flatmap a nested Dataframe in Spark

I have nested string like as shown below. I want to flat map them to produce unique rows in Spark

My dataframe has

A,B,"x,y,z",D

I want to convert it to produce output like

A,B,x,D
A,B,y,D
A,B,z,D

How can I do that.

Basically how can i do flat map and apply any function inside the Dataframe

Thanks

like image 438
user2230605 Avatar asked Apr 22 '16 04:04

user2230605


People also ask

How do I flatten nested array in spark?

In the Spark SQL, flatten function is a built-in function that is defined as a function to convert an Array of the Array column (nested array) that is ArrayanyType(ArrayanyType(StringType)) into the single array column on the Spark DataFrame. The Spark SQL is defined as the Spark module for structured data processing.

Can we apply flatMap on DataFrame?

flatMap() on Spark DataFrame operates similar to RDD, when applied it executes the function specified on every element of the DataFrame by splitting or merging the elements hence, the result count of the flapMap() can be different. This yields below output after flatMap() transformation.

Is flatMap a transformation in spark?

A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function.


1 Answers

Spark 2.0+

Dataset.flatMap:

val ds = df.as[(String, String, String, String)]
ds.flatMap { 
  case (x1, x2, x3, x4) => x3.split(",").map((x1, x2, _, x4))
}.toDF

Spark 1.3+.

Use split and explode functions:

val df = Seq(("A", "B", "x,y,z", "D")).toDF("x1", "x2", "x3", "x4")
df.withColumn("x3", explode(split($"x3", ",")))

Spark 1.x

DataFrame.explode (deprecated in Spark 2.x)

df.explode($"x3")(_.getAs[String](0).split(",").map(Tuple1(_)))
like image 121
zero323 Avatar answered Sep 21 '22 21:09

zero323