Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.

df_final = df_final.union(join_df)

df_final contains the value as such:

enter image description here

I tried something like this. But it created a invalid json.

df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)


My expected file should have data as below:

like image 423
Shankar Panda Avatar asked Nov 22 '18 08:11

Shankar Panda

People also ask

How do I convert to JSON in PySpark?

The to_json() function in PySpark is defined as to converts the MapType or Struct type to JSON string. The json_tuple() function in PySpark is defined as extracting the Data from JSON and then creating them as the new columns.

How do I export a DataFrame in PySpark?

Use the write() method of the PySpark DataFrameWriter object to export PySpark DataFrame to a CSV file. Using this you can save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn't write a header or column names.

3 Answers

For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.


and still you want to convert your datafram into json then you can used df_final.toJSON().

like image 57
Sahil Desai Avatar answered Sep 27 '22 20:09

Sahil Desai

A solution can be using collect and then using json.dump:

import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
    json.dump(data, outfile)
like image 45
OmG Avatar answered Sep 27 '22 20:09


If you want to use spark to process result as json files, I think that your output schema is right in hdfs.

And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :

with open('data.json') as f:
  data = json.load(f)

You should try to read data line by line:

data = []
with open("data.json",'r') as datafile:
  for line in datafile:

and you can use pandas to create dataframe :

df = pd.DataFrame(data) 
like image 26
chilun Avatar answered Sep 27 '22 19:09
