Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Row to JSON

I would like to create a JSON from a Spark v.1.6 (using scala) dataframe. I know that there is the simple solution of doing df.toJSON.

However, my problem looks a bit different. Consider for instance a dataframe with the following columns:

|  A  |     B     |  C1  |  C2  |    C3   |
-------------------------------------------
|  1  | test      |  ab  |  22  |  TRUE   |
|  2  | mytest    |  gh  |  17  |  FALSE  |

I would like to have at the end a dataframe with

|  A  |     B     |                        C                   |
----------------------------------------------------------------
|  1  | test      | { "c1" : "ab", "c2" : 22, "c3" : TRUE }    |
|  2  | mytest    | { "c1" : "gh", "c2" : 17, "c3" : FALSE }   |

where C is a JSON containing C1, C2, C3. Unfortunately, I at compile time I do not know what the dataframe looks like (except the columns A and B that are always "fixed").

As for the reason why I need this: I am using Protobuf for sending around the results. Unfortunately, my dataframe sometimes has more columns than expected and I would still send those via Protobuf, but I do not want to specify all columns in the definition.

How can I achieve this?

like image 972
navige Avatar asked Mar 22 '16 14:03

navige


2 Answers

Spark 2.1 should have native support for this use case (see #15354).

import org.apache.spark.sql.functions.to_json
df.select(to_json(struct($"c1", $"c2", $"c3")))
like image 71
Michael Armbrust Avatar answered Sep 18 '22 07:09

Michael Armbrust


I use this command to solve the to_json problem:

output_df = (df.select(to_json(struct(col("*"))).alias("content")))
like image 43
Cyanny Avatar answered Sep 20 '22 07:09

Cyanny