In my hdfs-site.xml
I configured a replication factor of 1.
However, when writing my result to hdfs:
someMap.saveAsTextFile("hdfs://HOST:PORT/out")
the results get automatically replicated by a factor of 3, overwriting my own replication factor. To save some space, I would prefer to have a replication factor of 1 for my output as well.
How can spark tell HDFS to use replication factor 1?
I think spark is loading a default hadoop config that has replication set to 3. To override it you need to either set an environment variable or system property similar to other spark configurations that you can find here.
You probably want something like:
System.setProperty("spark.hadoop.dfs.replication", "1")
or in your jvm startup:
-Dspark.hadoop.dfs.replication=1
Hopefully something like this should work...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With