Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flattening Rows in Spark

I am doing some testing for spark using scala. We usually read json files which needs to be manipulated like the following example:

test.json:

{"a":1,"b":[2,3]} 
val test = sqlContext.read.json("test.json") 

How can I convert it to the following format:

{"a":1,"b":2} {"a":1,"b":3} 
like image 709
Nir Ben Yaacov Avatar asked Oct 02 '15 11:10

Nir Ben Yaacov


People also ask

What is flattening in Spark?

Flatten – Creates a single array from an array of arrays (nested array). If a structure of nested arrays is deeper than two levels then only one level of nesting is removed.

How do I flatten an array column in PySpark?

If you want to flatten the arrays, use flatten function which converts array of array columns to a single array on DataFrame.

What is explode in Spark?

explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.

How do you explode an array in Spark?

Spark SQL explode function is used to create or split an array or map DataFrame columns to rows. Spark defines several flavors of this function; explode_outer – to handle nulls and empty, posexplode – which explodes with a position of element and posexplode_outer – to handle nulls.


1 Answers

You can use explode function:

scala> import org.apache.spark.sql.functions.explode import org.apache.spark.sql.functions.explode   scala> val test = sqlContext.read.json(sc.parallelize(Seq("""{"a":1,"b":[2,3]}"""))) test: org.apache.spark.sql.DataFrame = [a: bigint, b: array<bigint>]  scala> test.printSchema root  |-- a: long (nullable = true)  |-- b: array (nullable = true)  |    |-- element: long (containsNull = true)  scala> val flattened = test.withColumn("b", explode($"b")) flattened: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]  scala> flattened.printSchema root  |-- a: long (nullable = true)  |-- b: long (nullable = true)  scala> flattened.show +---+---+ |  a|  b| +---+---+ |  1|  2| |  1|  3| +---+---+ 
like image 142
zero323 Avatar answered Sep 25 '22 12:09

zero323