I have a MapReduce job that I'm trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx
?
In MR, I was using the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class to achieve this,
PS: I did try the saveAsTextFile()
method. For example:
lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")
This will create the same part-0000
files.
[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r----- 2 itsjeevs itsjeevs 0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r-- 2 itsjeevs itsjeevs 101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r-- 2 itsjeevs itsjeevs 17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001
EDIT
Recently read the article which would make life much easier for Spark users.
Use fs. rename() by passing source and destination paths to rename a file.
Write & Read Text file from HDFSUse textFile() and wholeTextFiles() method of the SparkContext to read files from any file system and to read from HDFS, you need to provide the hdfs path as an argument to the function. If you wanted to read a text file from an HDFS into DataFrame.
Spark's default file format is Parquet. Parquet has a number of advantages that improves the performance of querying and filtering the data.
_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally. . crc files: I have not seen the . crc files before, but yes, presumably they are checks on the part- files.
Spark is also using Hadoop under the hood, so you can probably get what you want. This is how saveAsTextFile
is implemented:
def saveAsTextFile(path: String) {
this.map(x => (NullWritable.get(), new Text(x.toString)))
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}
You could pass in a customized OutputFormat
to saveAsHadoopFile
. I have no idea how to do that from Python though. Sorry for the incomplete answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With