Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: Convert column with a JSON String to new Dataframe in Scala spark [duplicate]

I have a DataFrame with a column of string type, this string is a JSON format, I wanted to convert this column to multiple columns based on this JSON format. I can do it if I have the JSON schema, but I don't have it.

Example:

Original Dataframe:

---------------------
|        json_string|
---------------------
|{"a":2,"b":"hello"}|
|   {"a":1,"b":"hi"}|
---------------------

After Conversion/Parse

--------------
|  a |     b |
--------------
|  2 |  hello|
|  1 |     hi|
--------------

I using Apache Spark 2.1.1.

like image 821
Clairton Menezes Avatar asked Dec 05 '22 11:12

Clairton Menezes


1 Answers

If you do not have a predefined schema the other option is to convert it to RDD[String] or Dataset[String] and load as a json

Here is how you can do

//convert to RDD[String]
val rdd = originalDF.rdd.map(_.getString(0))

val ds = rdd.toDS

Now load as a json

val df = spark.read.json(rdd) // or spark.read.json(ds)

df.show(false)

Also use json(ds), json(rdd) is deprecated from 2.2.0

@deprecated("Use json(Dataset[String]) instead.", "2.2.0")

Output:

+---+-----+
|a  |b    |
+---+-----+
|2  |hello|
|1  |hi   |
+---+-----+
like image 169
koiralo Avatar answered Dec 29 '22 05:12

koiralo