I am doing some testing for spark using scala. We usually read json files which needs to be manipulated like the following example:
test.json:
{"a":1,"b":[2,3]}
val test = sqlContext.read.json("test.json")
How can I convert it to the following format:
{"a":1,"b":2} {"a":1,"b":3}
Flatten – Creates a single array from an array of arrays (nested array). If a structure of nested arrays is deeper than two levels then only one level of nesting is removed.
If you want to flatten the arrays, use flatten function which converts array of array columns to a single array on DataFrame.
explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.
Spark SQL explode function is used to create or split an array or map DataFrame columns to rows. Spark defines several flavors of this function; explode_outer – to handle nulls and empty, posexplode – which explodes with a position of element and posexplode_outer – to handle nulls.
You can use explode
function:
scala> import org.apache.spark.sql.functions.explode import org.apache.spark.sql.functions.explode scala> val test = sqlContext.read.json(sc.parallelize(Seq("""{"a":1,"b":[2,3]}"""))) test: org.apache.spark.sql.DataFrame = [a: bigint, b: array<bigint>] scala> test.printSchema root |-- a: long (nullable = true) |-- b: array (nullable = true) | |-- element: long (containsNull = true) scala> val flattened = test.withColumn("b", explode($"b")) flattened: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint] scala> flattened.printSchema root |-- a: long (nullable = true) |-- b: long (nullable = true) scala> flattened.show +---+---+ | a| b| +---+---+ | 1| 2| | 1| 3| +---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With