I'm trying to executing (with Pycharm) some examples in python using self contained sparks applications.
I installed pyspark using:
pip install pyspark
According to the web of the example, it should be just enough to execute it with:
python nameofthefile.py
But I have this error:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
at java.base/java.lang.String.substring(String.java:1874)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
... 23 more
Traceback (most recent call last):
File "C:/Users/.../PycharmProjects/PoC/Databricks.py", line 4, in <module>
spark = SparkSession.builder.appName("Databricks").getOrCreate()
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 349, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 298, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
What could be wrong?
EXTRA
According to the post where you can find the solution,for my case I had to change from jdk-11 to jdk1.8.
Now I can run the example code, but with an error (which does not prevent it from running)
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2019-01-24 08:46:16 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Here is the solution of this Could not locate executable null\bin\winutils.exe
Resuming, to solve this second problem, you just need to define HADOOP_HOME and PATH environment variables in Control Panel so any Windows program would use them.
Short answer:
I had a similar issue, which I resolved by changing my JAVA_HOME environment variables configuration. You may either manually add a new user environment variable JAVA_HOME linking to the path of you Java Development Kit (something "C:/Progra~1/Java/jdk1.8.0_121", or "C:/Progra~2/Java/jdk1.8.0_121" if it is installed in "Program Files (x86)" on your Windows).
You may also try something like this at the beginning of your python code:
import os
os.environ["JAVA_HOME"] = "C:/Progra~1/Java/jdk1.8.0_121"
(or, again, "C:/Progra~2/Java/jdk1.8.0_121" if your JDK is installed under "Program Files (x86)"
Longer answer: Independently of Pyspark, did you install Spark binaries (with hadoop included) ? You also need to install a compatible java development kit (JDK) (java 8+ from Spark 2.3.0). You also need to configure user environment variables such as: JAVA_HOME with the path to the java development kit SPARK_HOME with the path to the SPARK binaries HADOOP_HOME with the path to the hadoop binaries
You may do something like this from python:
import os
os.environ["JAVA_HOME"] = "C:/Progra~2/Java/jdk1.8.0_121"
os.environ["SPARK_HOME"] = "/path/to/spark-2.3.1-bin-hadoop2.7"
Then I recommend to use findspark (which you can install wich pip install findspark): https://github.com/minrk/findspark
You can then use it like this:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
Especially if you are on Windows, JAVA_HOME should be something like:
C:\Progra~1\Java\jdk1.8.0_121
And, "If JDK is installed under \Program Files (x86), then replace the Progra~1 part by Progra~2 instead."
Details can be found here for installation on Windows (it is for jupyter but the installation of spark and pyspark is the same): https://changhsinlee.com/install-pyspark-windows-jupyter/
I hope it helps, Good luck and have a nice day/evening!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With