Specifying the output file name in Apache Spark

Tags:

python

apache-spark

I have a MapReduce job that I'm trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx?

In MR, I was using the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat class to achieve this,

PS: I did try the saveAsTextFile() method. For example:

lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")

This will create the same part-0000 files.

[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r-----   2 itsjeevs itsjeevs          0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r--   2 itsjeevs itsjeevs  101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r--   2 itsjeevs itsjeevs   17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001

EDIT

Recently read the article which would make life much easier for Spark users.

560

asked Aug 13 '14 18:08

Jeevs

1 Answers

Spark is also using Hadoop under the hood, so you can probably get what you want. This is how saveAsTextFile is implemented:

def saveAsTextFile(path: String) {
  this.map(x => (NullWritable.get(), new Text(x.toString)))
    .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

You could pass in a customized OutputFormat to saveAsHadoopFile. I have no idea how to do that from Python though. Sorry for the incomplete answer.

111

answered Sep 24 '22 13:09

Daniel Darabos

Related questions
                            
                                Turning binary string into an image with PIL
                            
                                Pandas rolling apply with variable window length
                            
                                Python Pandas DataFrame: unorderable types: str() > int()
                            
                                NumPy convert 8-bit to 16/32-bit image
                            
                                Get xpath() to return empty values
                            
                                io.BufferedReader peek function returning all the text in the buffer
                            
                                Save a many-to-many model in Django/REST?
                            
                                /_ah/queue/deferred strange import error
                            
                                What's a good way to handle url parameters types?
                            
                                Python - Matplotlib: normalize axis when plotting a Probability Density Function
                            
                                Organizing a package with Cython
                            
                                Django REST framework: nested relationship: non_field_errors
                            
                                PyDev: How to invoke debugging specific command from console (with breakpoints)?
                            
                                Simulate missing package for testing?
                            
                                Loading JSON file in BigQuery using Google BigQuery Client API
                            
                                Doctests fail with UnicodeDecodeError on C-extension and Python3
                            
                                Opening/Attempting to Read a file [duplicate]
                            
                                There is an example of Spyne client?
                            
                                Different storage position of equal strings with special characters [duplicate]
                            
                                Python NLP: TypeError: not all arguments converted during string formatting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With