Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000? Is there any significance/any nomenclature or is this just a randomly defined?

like image 687
Ravi Joshi Avatar asked May 19 '12 15:05

Ravi Joshi


People also ask

What is part-r-00000?

So if a job which has 10 reducers, files generated will have named part-r-00000 to part-r-00009, one for each reducer task. It is possible to change the default name.

What is success file in Hadoop?

In Hadoop, whenever there is a successful creation of any job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS.

What is _success file in Spark?

_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally. . crc files: I have not seen the . crc files before, but yes, presumably they are checks on the part- files.

What is the output file name in MapReduce?

is there any reason behind output file name. I always see the name like part-r-00000 for MapReduce job and part-m-00000 for Map-only Job. So a job which has 32 reducers will have files named part-r-00000 to part-r-00031, one for each reducer task.


1 Answers

See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/

On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947)

This would typically be used by job scheduling systems (such as OOZIE), to denote that follow-on processing on the contents of this directory can commence as all the data has been output.

Update (in response to comment)

The output files are by default named part-x-yyyyy where:

  • x is either 'm' or 'r', depending on whether the job was a map only job, or reduce
  • yyyyy is the mapper or reducer task number (zero based)

So a job which has 32 reducers will have files named part-r-00000 to part-r-00031, one for each reducer task.

like image 167
Chris White Avatar answered Sep 28 '22 02:09

Chris White