I'm running dataFrame.rdd.saveAsTextFile("/home/hadoop/test") in an attempt to write a data frame to disk. This executes with no errors, but the folder is not created. Furthermore, when I run the same command again (in the shell) an Exception is thrown:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/feet already exists
Any idea why this is? Is there a nuance of the submission move (client, cluster) that affects this?
EDIT:
I have permission to create directories in /home/hadoop but I cannot create directories inside any of the dirs/sub-dirs created by rdd.saveAsTextFile("file:/home/hadoop/test"). The structure looks like this:
/home/hadoop/test/_temporary/0
How are _temporary and 0 being created if I do not have permission to create directories inside test from the command line? Is there a way to change the permission of these created directories?
Edit2:
In the end I wrote to s3 instead using rdd.coalesce(1).saveAsTextFile("s3://..."). This is only viable if you have a very small output - because coalesce(n) will cause the RDD to exist and be processed further on only n workers. In my case, I chose 1 worker so that the file would be generated by one worker. This gave me a folder containing one part-00000 file which had all of my data.
Since https://spark-project.atlassian.net/browse/SPARK-1100 saveAsTextFile should never be able to silently overwrite an already existing folder.
If you receive an java.io.IOException: Mkdirs failed to create file:... it probably means you have permission problems when trying to write in the output path.
If you give more context info the answers could be more helpful. Like: are you running on local shell? cluster shell? which type of cluster?
EDIT: I think you are facing that error because all executors are trying to write to same same path which isn't available on all executors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With