Spark Standalone Mode: Change replication factor of HDFS output

Question

In my hdfs-site.xml I configured a replication factor of 1.

However, when writing my result to hdfs:

someMap.saveAsTextFile("hdfs://HOST:PORT/out")

the results get automatically replicated by a factor of 3, overwriting my own replication factor. To save some space, I would prefer to have a replication factor of 1 for my output as well.

How can spark tell HDFS to use replication factor 1?

Noah · Accepted Answer

I think spark is loading a default hadoop config that has replication set to 3. To override it you need to either set an environment variable or system property similar to other spark configurations that you can find here.

You probably want something like:

System.setProperty("spark.hadoop.dfs.replication", "1")

or in your jvm startup:

 -Dspark.hadoop.dfs.replication=1

Hopefully something like this should work...

Spark Standalone Mode: Change replication factor of HDFS output

Tags:

scala

apache-spark

hdfs

ptikobj

1 Answers

Noah

Recent Activity

Donate For Us

Spark Standalone Mode: Change replication factor of HDFS output

Tags:

scala

apache-spark

hdfs

ptikobj

1 Answers

Noah

Related questions

Recent Activity

Donate For Us