Pyspark dataframe write to single json file with specific name

Tags:

apache-spark

pyspark

I have a dataframe which I want to write it as single json file with a specific name. I tried below

df2 = df1.select(df1.col1,df1.col2)
df2.write.format('json').save('/path/file_name.json') # didnt work, writing in folder 'file_name.json' and files with part-XXX
df2.toJSON().saveAsTextFile('/path/file_name.json')  # didnt work, writing in folder 'file_name.json' and files with part-XXX

Appreciate if some one can provide a solution.

218

asked Apr 07 '17 03:04

Lijju Mathew

2 Answers

You need to save this on single file using below code:-

df2 = df1.select(df1.col1,df1.col2)
df2.coalesce(1).write.format('json').save('/path/file_name.json')

This will make a folder with file_name.json. Check this folder you can get a single file with whole data part-000

170

answered Oct 12 '22 07:10

Rakesh Kumar

You can do it by converting to a pandas df previously:

df.toPandas().to_json('path/file_name.json', orient='records', force_ascii=False, lines=True)

answered Oct 12 '22 07:10

fedosique

Related questions
                            
                                Why does spark-shell fail with “error: not found: value spark”?
                            
                                Problems while compiling Spark with maven
                            
                                Add a column from another DataFrame
                            
                                No FileSystem for scheme: s3 with pyspark
                            
                                How to monitor Apache Spark with Prometheus?
                            
                                Creating User Defined Function in Spark-SQL
                            
                                Append new data to partitioned parquet files
                            
                                AnalysisException: u"cannot resolve 'name' given input columns: [ list] in sqlContext in spark
                            
                                How to split parquet files into many partitions in Spark?
                            
                                S3 SlowDown error in Spark on EMR
                            
                                Play! and Spark incompatible Jackson versions
                            
                                Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
                            
                                How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?
                            
                                Could not find S3 endpoint or NAT gateway for subnetId
                            
                                How to prepare data into a LibSVM format from DataFrame?
                            
                                Spark submit does automatically upload the jar to cluster?
                            
                                How to create a Spark Dataset from an RDD
                            
                                How to name aggregate columns?
                            
                                Passing Arguments in Apache Spark
                            
                                extracting numpy array from Pyspark Dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With