Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the files generated by Spark when using "saveAsTextFile"?

Tags:

When I run a Spark job and save the output as a text file using method "saveAsTextFile" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD :

enter image description here

here are the files that are created :

enter image description here

Is the .crc file a Cyclic Redundancy Check file ? and so is used to check that the content of each generated file IS correct ?

The _SUCCESS file is always empty, what does this signify ?

The files that do not have an extension in above screenshot contain the actual data from the RDD but why are many files generated instead of just one ?

like image 499
blue-sky Avatar asked May 27 '14 20:05

blue-sky


1 Answers

Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().

  • part- files: These are your output data files.

    You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.

    You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:

    # PySpark
    # Get the number of partitions of my_rdd.
    my_rdd._jrdd.splits().size()
    
  • _SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.

  • .crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.

like image 187
Nick Chammas Avatar answered Nov 09 '22 12:11

Nick Chammas