I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. When i give it the path to the directory it returns an error stating it already exists. My assumption based off the documentation was that it would save a json file in the path that you give it.
df.write.json("C:\Users\username")
Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files.
df.write.json("C:\Users\username\test")
And adding a file extension of JSON, produces the same error
df.write.json("C:\Users\username\test.JSON")
To convert the object to a JSON string, then use the Pandas DataFrame. to_json() function. Pandas to_json() is an inbuilt DataFrame function that converts the object to a JSON string. To export pandas DataFrame to a JSON file, then use the to_json() function.
Could you not just use
df.toJSON()
as shown here? If not, then first transform into a pandas DataFrame and then write to json.
pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")
When working with large data converting pyspark dataframe to pandas is not advisable. you can use below command to save json file in output directory. Here df is pyspark.sql.dataframe.DataFrame. Part file will be generated inside the output directory by the cluster.
df.coalesce(1).write.format('json').save('/your_path/output_directory')
I would avoid using write.json
since its causing problems on Windows. Using Python's file writing should skip creating the temp directories that are giving you issues.
with open("C:\\Users\\username\\test.json", "w+") as output_file:
output_file.write(df.toJSON())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With