Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set hadoop configuration values from pyspark

The Scala version of SparkContext has the property

sc.hadoopConfiguration

I have successfully used that to set Hadoop properties (in Scala)

e.g.

sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")

However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?

like image 516
WestCoastProjects Avatar asked Mar 04 '15 00:03

WestCoastProjects


People also ask

What is _JSC in PySpark?

PySpark has a JVM running behind the scenes to actually run the spark code. sc. _jvm is the gateway into said JVM and sc. _jsc is the Java Spark Context which is a proxy into the SparkContext in that JVM.

What is SparkContext?

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.

What is PySpark?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.


3 Answers

sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal') 

should work

like image 97
Dmytro Popovych Avatar answered Sep 26 '22 23:09

Dmytro Popovych


You can set any Hadoop properties using the --conf parameter while submitting the job.

--conf "spark.hadoop.fs.mapr.trace=debug" 

Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105

like image 24
Harikrishnan Ck Avatar answered Sep 25 '22 23:09

Harikrishnan Ck


I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:

fileLines = sc.newAPIHadoopFile('dev/*', 
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()
like image 45
WestCoastProjects Avatar answered Sep 25 '22 23:09

WestCoastProjects