I have a conda installation of python 3.7
$python3 --version
Python 3.7.6
pyspark was installed via pip3 install (conda does not have a native package for it).
$conda list | grep pyspark
pyspark 2.4.5 pypi_0 pypi
Here is what pip3 tells me:
$pip3 install pyspark
Requirement already satisfied: pyspark in ./miniconda3/lib/python3.7/site-packages (2.4.5)
Requirement already satisfied: py4j==0.10.7 in ./miniconda3/lib/python3.7/site-packages (from pyspark) (0.10.7)
jdk 11 is installed:
$java -version
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
When attempting to import pyspark things are not going so well. Here is a mini test program:
from pyspark.sql import SparkSession
import os, sys
def setupSpark():
os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"
spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
return spark
sp = setupSpark()
df = sp.createDataFrame({'a':[1,2,3],'b':[4,5,6]})
df.show()
That results in :
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
Here is full details:
$python3 sparktest.py
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
Traceback (most recent call last):
File "sparktest.py", line 9, in <module>
sp = setupSpark()
File "sparktest.py", line 6, in setupSpark
spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Any pointers or info on working environment in conda would be appreciated.
Update It may be the case that pyspark were available from conda-forge only: I only started to use that for conda install recently. But it does not change the result:
conda install -c conda-forge conda-forge::pyspark
Collecting package metadata (current_repodata.json): done
Solving environment: done
# All requested packages already installed.
Re-running the code above still gives us:
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
The following steps are for running your mini test program in Conda environment:
Step 1: Create and activate a new Conda environment
conda create -n test python=3.7 -y
conda activate test
Step 2: Install the latest pyspark and pandas
pip install -U pyspark pandas # Note: I also tested pyspark version 2.4.7
Step 3: Run the mini test. (I have updated some changes to create DataFrame from DataFrame instead of dict)
from pyspark.sql import SparkSession
import os, sys
import pandas as pd
def setupSpark():
os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"
spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
return spark
sp = setupSpark()
df = sp.createDataFrame(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}))
df.show()
Step 4: Enjoy the output
+---+---+
| a| b|
+---+---+
| 1| 4|
| 2| 5|
| 3| 6|
+---+---+
Java version that I used to install pyspark
$ java -version
java version "15.0.2" 2021-01-19
Java(TM) SE Runtime Environment (build 15.0.2+7-27)
Java HotSpot(TM) 64-Bit Server VM (build 15.0.2+7-27, mixed mode, sharing)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With