Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file? If I run: <pre class="prettyprint"><code> df.write.format('json').save('myfile.json') </code></pre> or <pre class="prettyprint"><code>df1.write.json('myfile.json') </code></pre> it creates the folder named <code>myfile</code> and within it I find several small files named <code>part-***</code>, the HDFS way. Is it by any means possible to have it spit out a single file instead?

Well, the answer to your exact question is <code>coalesce</code> function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially. <pre class="prettyprint"><code>df.coalesce(1).write.format('json').save('myfile.json') </code></pre> P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

PySpark: spit out single file when writing instead of multiple part files

Tags:

python

amazon-s3

apache-spark

apache-spark-sql

pyspark

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?

If I run:

 df.write.format('json').save('myfile.json')

df1.write.json('myfile.json')

it creates the folder named myfile and within it I find several small files named part-***, the HDFS way. Is it by any means possible to have it spit out a single file instead?

859

asked Mar 22 '16 18:03

mar tin

1 Answers

Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.

df.coalesce(1).write.format('json').save('myfile.json')

P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

103

answered Oct 29 '22 00:10

the.malkolm

Related questions
                            
                                How to convert django model to abstract model if it already has related classes
                            
                                Matplotlib: how to plot a line with categorical data on the x-axis?
                            
                                how to plot bar chart for a list in python
                            
                                How to parse nested FB API response from Python SDK
                            
                                Getting good mixing with many input datafiles in tensorflow
                            
                                Count number of records in lmdb databse with python
                            
                                Random cropping data augmentation convolutional neural networks
                            
                                Is Python multiprocessing.Queue thread safe?
                            
                                How do I can install pip inside virtual environment
                            
                                Error Logging in Django and Gunicorn
                            
                                Override the authToken views in Django Rest
                            
                                Adding custom fields to a django model (without changes in source code)
                            
                                Vim searching: avoid matches within comments
                            
                                Variable not define after exec('variable = value')
                            
                                How do I check if a SQLite3 database is connected in Python?
                            
                                Loading empty dictionary when YAML file is empty (Python 3.4)
                            
                                How do you dynamically assign aliases in a django aggregate?
                            
                                Save pandas csv to sub-directory
                            
                                Return 'similar score' based on two dictionaries' similarity in Python?
                            
                                Sum of multiple list of lists index wise

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With