Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Standalone Mode: Change replication factor of HDFS output

In my hdfs-site.xml I configured a replication factor of 1.

However, when writing my result to hdfs:

someMap.saveAsTextFile("hdfs://HOST:PORT/out")

the results get automatically replicated by a factor of 3, overwriting my own replication factor. To save some space, I would prefer to have a replication factor of 1 for my output as well.

How can spark tell HDFS to use replication factor 1?

like image 311
ptikobj Avatar asked Mar 23 '23 11:03

ptikobj


1 Answers

I think spark is loading a default hadoop config that has replication set to 3. To override it you need to either set an environment variable or system property similar to other spark configurations that you can find here.

You probably want something like:

System.setProperty("spark.hadoop.dfs.replication", "1")

or in your jvm startup:

 -Dspark.hadoop.dfs.replication=1

Hopefully something like this should work...

like image 128
Noah Avatar answered Apr 06 '23 13:04

Noah