I am writing an RDD to a file using below command:
rdd.coalesce(1).saveAsTextFile(FilePath)
When the FilePath is HDFS path (hdfs://node:9000/folder/
) everything works fine.
When the FilePath is local path (file:///home/user/folder/
) everything seems to work. The output folder is created and SUCCESS
file is also present.
However I do not see any part-00000
file containing the output. There is no other file. There is no error in the spark console output either.
I also tried calling collect on the RDD before calling saveAsTextFile()
, giving 777 permission
to output folder but nothing is working.
Please help.
1. Write a Single file using Spark coalesce() & repartition() When you are ready to write a DataFrame, first use Spark repartition() and coalesce() to merge data from all partitions into a single partition and then save it to a file.
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory.
The question is why we need CRC and _SUCCESS files? Spark (worker) nodes write data simultaneously and these files act as checksum for validation. Writing to a single file takes away the idea of distributed computing and this approach may fail if your resultant file is too large.
save to local make effects only when using local
master
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With