Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DataFrame to Json Array in Spark

I am writing Spark Application in Java which reads the HiveTable and store the output in HDFS as Json Format.

I read the hive table using HiveContext and it returns the DataFrame. Below is the code snippet.

 SparkConf conf = new SparkConf().setAppName("App");
 JavaSparkContext sc = new JavaSparkContext(conf);
 HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);

DataFrame data1= hiveContext.sql("select * from tableName")

Now I want to convert DataFrame to JsonArray. For Example, data1 data looks like below

|  A  |     B     |
-------------------
|  1  | test      |
|  2  | mytest    |

I need an output like below

[{1:"test"},{2:"mytest"}]

I tried using data1.schema.json() and it gives me the output like below, not an Array.

{1:"test"}
{2:"mytest"}

What is the right approach or function to convert the DataFrame to jsonArray without using any third Party libraries.

like image 831
user2731629 Avatar asked Jan 04 '23 17:01

user2731629


1 Answers

data1.schema.json will give you a JSON string containing the schema of the dataframe and not the actual data itself. You will get :

String = {"type":"struct",
          "fields":
                  [{"name":"A","type":"integer","nullable":false,"metadata":{}},
                  {"name":"B","type":"string","nullable":true,"metadata":{}}]}

To convert your dataframe to array of JSON, you need to use toJSON method of DataFrame:

val df = sc.parallelize(Array( (1, "test"), (2, "mytest") )).toDF("A", "B")
df.show()

+---+------+
|  A|     B|
+---+------+
|  1|  test|
|  2|mytest|
+---+------+

df.toJSON.collect.mkString("[", "," , "]" )
String = [{"A":1,"B":"test"},{"A":2,"B":"mytest"}]
like image 137
philantrovert Avatar answered Jan 07 '23 10:01

philantrovert