Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

saving a dataframe to JSON file on local drive in pyspark

I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. When i give it the path to the directory it returns an error stating it already exists. My assumption based off the documentation was that it would save a json file in the path that you give it.

df.write.json("C:\Users\username")

Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files.

df.write.json("C:\Users\username\test")

And adding a file extension of JSON, produces the same error

df.write.json("C:\Users\username\test.JSON")
like image 336
Jared Avatar asked Jun 26 '15 15:06

Jared


People also ask

How do you save a DataFrame as a JSON file?

To convert the object to a JSON string, then use the Pandas DataFrame. to_json() function. Pandas to_json() is an inbuilt DataFrame function that converts the object to a JSON string. To export pandas DataFrame to a JSON file, then use the to_json() function.


3 Answers

Could you not just use

df.toJSON()

as shown here? If not, then first transform into a pandas DataFrame and then write to json.

pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")
like image 75
Wesley Bowman Avatar answered Oct 15 '22 03:10

Wesley Bowman


When working with large data converting pyspark dataframe to pandas is not advisable. you can use below command to save json file in output directory. Here df is pyspark.sql.dataframe.DataFrame. Part file will be generated inside the output directory by the cluster.

df.coalesce(1).write.format('json').save('/your_path/output_directory')
like image 20
Shreyak Avatar answered Oct 15 '22 02:10

Shreyak


I would avoid using write.json since its causing problems on Windows. Using Python's file writing should skip creating the temp directories that are giving you issues.

with open("C:\\Users\\username\\test.json", "w+") as output_file:
    output_file.write(df.toJSON())
like image 32
Brobin Avatar answered Oct 15 '22 04:10

Brobin