Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid generating crc files and SUCCESS files while saving a DataFrame?

I am using the following code to save a spark DataFrame to JSON file

unzipJSON.write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json") 

the output result is:

part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8 .part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8.crc part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8 .part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8.crc _SUCCESS ._SUCCESS.crc 
  1. How do I generate a single JSON file and not a file per line?
  2. How can I avoid the *crc files?
  3. How can I avoid the SUCCESS file?
like image 665
Eran Witkon Avatar asked Dec 20 '15 15:12

Eran Witkon


People also ask

How do I prevent _SUCCESS and _committed files in my write output?

enable. summary-metadata=false". We can also disable the _SUCCESS file using "mapreduce. fileoutputcommitter.

What is a .CRC file?

File created by Total Commander, a program used to organize and manage files in Windows; contains a Cyclic Redundancy Check (CRC) code for a split archive; used to verify that files from a split archive have correctly been restored back to the original file.

What is _SUCCESS file in Spark?

In Big Data Management (BDM0, when Spark writes to S3 a complex file of type Parquet, the resulting files may contain an additional _SUCCESS. This file is used by Spark to provide a method to confirm all the partitions have been written correctly. You can control this by setting the Hadoop property: mapreduce.


1 Answers

If you want a single file, you need to do a coalesce to a single partition before calling write, so:

unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json") 

Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write - especially if you do a write with a partitionBy - but as far as I know, there are currently no other way.

I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") 

Note, that you may also want to disable generation of the metadata files with:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") 

Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.

like image 160
Glennie Helles Sindholt Avatar answered Sep 20 '22 11:09

Glennie Helles Sindholt