I'm trying to learn Spark following some hello-word level example such as below, using pyspark
. I got a "Method isBarrier([]) does not exist" error, full error included below the code.
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext('local[6]', 'pySpark_pyCharm')
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8])
rdd.collect()
rdd.count()
Although, when I start a pyspark session in command line directly and type in the same code, it works fine:
My setup:
The problem is incompatibility between versions of Spark JVM libraries and PySpark. In general PySpark version has to exactly match the version of your Spark installation (while in theory matching major and minor versions should be enough, some incompatibilities in maintenance releases have been introduced in the past).
In other words Spark 2.3.3 is not compatible with PySpark 2.4.0 and you have to either upgrade Spark to 2.4.0 or downgrade PySpark to 2.3.3.
Overall PySpark is not designed to be used a standalone library. While PyPi package is a handy development tool (it is often easier to just install a package than manually extend the PYTHONPATH
), for actual deployments it is better to stick with the PySpark package bundled with actual Spark deployment.
Try starting your python script/session with
import findspark
findspark.init()
that updates sys.path based on the spark installation directory. Worked for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With