Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specifying the output file name in Apache Spark

I have a MapReduce job that I'm trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx?

In MR, I was using the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat class to achieve this,

PS: I did try the saveAsTextFile() method. For example:

lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")

This will create the same part-0000 files.

[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r-----   2 itsjeevs itsjeevs          0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r--   2 itsjeevs itsjeevs  101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r--   2 itsjeevs itsjeevs   17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001

EDIT

Recently read the article which would make life much easier for Spark users.

like image 560
Jeevs Avatar asked Aug 13 '14 18:08

Jeevs


People also ask

How do I rename a file in PySpark?

Use fs. rename() by passing source and destination paths to rename a file.

How do I read different file formats in Spark?

Write & Read Text file from HDFSUse textFile() and wholeTextFiles() method of the SparkContext to read files from any file system and to read from HDFS, you need to provide the hdfs path as an argument to the function. If you wanted to read a text file from an HDFS into DataFrame.

What is default file format in Spark?

Spark's default file format is Parquet. Parquet has a number of advantages that improves the performance of querying and filtering the data.

What is _success file in Spark?

_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally. . crc files: I have not seen the . crc files before, but yes, presumably they are checks on the part- files.


1 Answers

Spark is also using Hadoop under the hood, so you can probably get what you want. This is how saveAsTextFile is implemented:

def saveAsTextFile(path: String) {
  this.map(x => (NullWritable.get(), new Text(x.toString)))
    .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

You could pass in a customized OutputFormat to saveAsHadoopFile. I have no idea how to do that from Python though. Sorry for the incomplete answer.

like image 111
Daniel Darabos Avatar answered Sep 24 '22 13:09

Daniel Darabos