How to avoid generating crc files and SUCCESS files while saving a DataFrame?

Tags:

I am using the following code to save a spark DataFrame to JSON file

unzipJSON.write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")

the output result is:

part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8 .part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8.crc part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8 .part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8.crc _SUCCESS ._SUCCESS.crc

How do I generate a single JSON file and not a file per line?
How can I avoid the *crc files?
How can I avoid the SUCCESS file?

665

asked Dec 20 '15 15:12

Eran Witkon

1 Answers

If you want a single file, you need to do a coalesce to a single partition before calling write, so:

unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")

Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write - especially if you do a write with a partitionBy - but as far as I know, there are currently no other way.

I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Note, that you may also want to disable generation of the metadata files with:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.

160

answered Sep 20 '22 11:09

Glennie Helles Sindholt

Related questions
                            
                                Converting dynamic type to dictionary C#
                            
                                Web Api Model Binding and Polymorphic Inheritance
                            
                                Serializing nulls and empty Strings in dynamic JSON
                            
                                Flask slow at retrieving post data from request?
                            
                                Where can I find a pre-populated json file for testing?
                            
                                Jackson custom deserializer for one field with polymorphic types
                            
                                Any way to use webpack to load a resource at runtime?
                            
                                Make names of named tuples appear in serialized JSON responses
                            
                                How to receive JSON data on WebAPI backend C#?
                            
                                System.Web.Script.Serialization.JavaScriptSerializer or System.Runtime.Serialization.Json.DataContractJsonSerializer?
                            
                                Storing compressed json data in local storage
                            
                                In Django, getting a "Error: Unable to serialize database" when trying to dump data?
                            
                                Posting JSON data via jQuery to ASP .NET MVC 4 controller action
                            
                                How to access default jackson serialization in a custom serializer
                            
                                Set cache to files in Firebase Storage
                            
                                AngularJS dynamic form from json data (different types)
                            
                                Parsing a JSON file with .NET core 3.0/System.text.Json
                            
                                How to reuse Jersey's JSON/JAXB for serialization?
                            
                                Writing a json object to a text file in javascript
                            
                                How to convert a JSON object stream into an array with jq

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to avoid generating crc files and SUCCESS files while saving a DataFrame?

Tags:

json

apache-spark

spark-dataframe

Eran Witkon

People also ask

1 Answers

Glennie Helles Sindholt

Recent Activity

Donate For Us