Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.

df_final = df_final.union(join_df)

df_final contains the value as such:

enter image description here

I tried something like this. But it created a invalid json.

df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)

{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}

My expected file should have data as below:

[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
like image 423
Shankar Panda Avatar asked Nov 22 '18 08:11

Shankar Panda


People also ask

How do I convert to JSON in PySpark?

The to_json() function in PySpark is defined as to converts the MapType or Struct type to JSON string. The json_tuple() function in PySpark is defined as extracting the Data from JSON and then creating them as the new columns.

How do I export a DataFrame in PySpark?

Use the write() method of the PySpark DataFrameWriter object to export PySpark DataFrame to a CSV file. Using this you can save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn't write a header or column names.


3 Answers

For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.

df_final.coalesce(1).write.format('json').save('/path/file_name.json')

and still you want to convert your datafram into json then you can used df_final.toJSON().

like image 57
Sahil Desai Avatar answered Sep 27 '22 20:09

Sahil Desai


A solution can be using collect and then using json.dump:

import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
    json.dump(data, outfile)
like image 45
OmG Avatar answered Sep 27 '22 20:09

OmG


If you want to use spark to process result as json files, I think that your output schema is right in hdfs.

And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :

with open('data.json') as f:
  data = json.load(f)

You should try to read data line by line:

data = []
with open("data.json",'r') as datafile:
  for line in datafile:
    data.append(json.loads(line))

and you can use pandas to create dataframe :

df = pd.DataFrame(data) 
like image 26
chilun Avatar answered Sep 27 '22 19:09

chilun