I have a DataFrame with a column of string type, this string is a JSON format, I wanted to convert this column to multiple columns based on this JSON format. I can do it if I have the JSON schema, but I don't have it.
Example:
Original Dataframe:
---------------------
| json_string|
---------------------
|{"a":2,"b":"hello"}|
| {"a":1,"b":"hi"}|
---------------------
After Conversion/Parse
--------------
| a | b |
--------------
| 2 | hello|
| 1 | hi|
--------------
I using Apache Spark 2.1.1.
If you do not have a predefined schema the other option is to convert it to RDD[String]
or Dataset[String]
and load as a json
Here is how you can do
//convert to RDD[String]
val rdd = originalDF.rdd.map(_.getString(0))
val ds = rdd.toDS
Now load as a json
val df = spark.read.json(rdd) // or spark.read.json(ds)
df.show(false)
Also use json(ds)
, json(rdd)
is deprecated from 2.2.0
@deprecated("Use json(Dataset[String]) instead.", "2.2.0")
Output:
+---+-----+
|a |b |
+---+-----+
|2 |hello|
|1 |hi |
+---+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With