The Scala version of SparkContext has the property <pre class="prettyprint"><code>sc.hadoopConfiguration </code></pre> I have successfully used that to set Hadoop properties (in Scala) e.g. <pre class="prettyprint"><code>sc.hadoopConfiguration.set("my.mapreduce.setting","someVal") </code></pre> However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?

You can set any Hadoop properties using the <code>--conf</code> parameter while submitting the job. <pre class="prettyprint"><code>--conf "spark.hadoop.fs.mapr.trace=debug" </code></pre> Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105

I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs: <pre class="prettyprint lang-py prettyprint-override"><code>fileLines = sc.newAPIHadoopFile('dev/*', 'org.apache.hadoop.mapreduce.lib.input.TextInputFormat', 'org.apache.hadoop.io.LongWritable', 'org.apache.hadoop.io.Text', conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'} ).count() </code></pre>

How to set hadoop configuration values from pyspark

Tags:

scala

apache-spark

pyspark

The Scala version of SparkContext has the property

sc.hadoopConfiguration

I have successfully used that to set Hadoop properties (in Scala)

e.g.

sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")

However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?

516

asked Mar 04 '15 00:03

WestCoastProjects

3 Answers

sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')

should work

answered Sep 26 '22 23:09

Dmytro Popovych

You can set any Hadoop properties using the --conf parameter while submitting the job.

--conf "spark.hadoop.fs.mapr.trace=debug"

Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105

answered Sep 25 '22 23:09

Harikrishnan Ck

I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:

fileLines = sc.newAPIHadoopFile('dev/*', 
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()

answered Sep 25 '22 23:09

WestCoastProjects

Related questions
                            
                                Encoding Standard ML modules in OO
                            
                                Difference between function with parentheses and without [duplicate]
                            
                                What does `T {}` do in Scala
                            
                                What ORMs work well with Scala? [closed]
                            
                                What is the difference between unapply and unapplySeq?
                            
                                Scala String vs java.lang.String - type inference
                            
                                PermGen problems with Lift and Jetty
                            
                                How to disambiguate links to methods in scaladoc?
                            
                                Turning A => M[B] into M[A => B]
                            
                                How to resolve the AnalysisException: resolved attribute(s) in Spark
                            
                                Scala - extends vs with
                            
                                Un-optioning an optioned Option
                            
                                Scala - infix vs dot notation
                            
                                Scala convert Iterable or collection.Seq to collection.immutable.Seq
                            
                                What is a TrieMap and what is its advantages/disadvantages compared to a HashMap?
                            
                                How do I expose Scala constructor arguments as public members?
                            
                                How to cast Long to Int in Scala?
                            
                                What are the Spark transformations that causes a Shuffle?
                            
                                Function parameter types and =>
                            
                                Scala foreach strange behaviour

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to set hadoop configuration values from pyspark

Tags:

scala

apache-spark

pyspark

WestCoastProjects

People also ask

3 Answers

Dmytro Popovych

Harikrishnan Ck

WestCoastProjects

Recent Activity

Donate For Us