I'm setting up GeoSpark Python and after installing all the pre-requisites, I'm running the very basic code examples to test it.
from pyspark.sql import SparkSession
from geo_pyspark.register import GeoSparkRegistrator
spark = SparkSession.builder.\
        getOrCreate()
GeoSparkRegistrator.registerAll(spark)
df = spark.sql("""SELECT st_GeomFromWKT('POINT(6.0 52.0)') as geom""")
df.show()
I tried running it with python3 basic.py and spark-submit basic.py, both give me this error:
Traceback (most recent call last):
  File "/home/jessica/Downloads/geo_pyspark/basic.py", line 8, in <module>
    GeoSparkRegistrator.registerAll(spark)
  File "/home/jessica/Downloads/geo_pyspark/geo_pyspark/register/geo_registrator.py", line 22, in registerAll
    cls.register(spark)
  File "/home/jessica/Downloads/geo_pyspark/geo_pyspark/register/geo_registrator.py", line 27, in register
    spark._jvm. \
TypeError: 'JavaPackage' object is not callable
I'm using Java 8, Python 3, Apache Spark 2.4, my JAVA_HOME is set correctly, I'm running Linux Mint 19. My SPARK_HOME is also set:
$ printenv SPARK_HOME
/home/jessica/spark/
How can I fix this?
The Jars for geoSpark are not correctly registered with your Spark Session. There's a few ways around this ranging from a tad inconvenient to pretty seamless. For example, if when you call spark-submit you specify:
--jars jar1.jar,jar2.jar,jar3.jar 
then the problem will go away, you can also provide a similar command to pyspark if that's your poison. 
If, like me, you don't really want to be doing this every time you boot (and setting this as a .conf() in Jupyter will get tiresome) then instead you can go into $SPARK_HOME/conf/spark-defaults.conf and set:
spark-jars jar1.jar,jar2.jar,jar3.jar
Which will then be loaded when you create a spark instance. If you've not used the conf file before it'll be there as spark-defaults.conf.template.
Of course, when I say jar1.jar.... What I really mean is something along the lines of:
/jars/geo_wrapper_2.11-0.3.0.jar,/jars/geospark-1.2.0.jar,/jars/geospark-sql_2.3-1.2.0.jar,/jars/geospark-viz_2.3-1.2.0.jar 
but that's up to you to get the right ones from the geo_pyspark package.
If you are using an EMR: You need to set your cluster config json to
[
  {
    "classification":"spark-defaults", 
    "properties":{
      "spark.jars": "/jars/geo_wrapper_2.11-0.3.0.jar,/jars/geospark-1.2.0.jar,/jars/geospark-sql_2.3-1.2.0.jar,/jars/geospark-viz_2.3-1.2.0.jar"
      }, 
    "configurations":[]
  }
]
and also get your jars to upload as part of your bootstrap. You can do this from Maven but I just threw them on an S3 bucket:
#!/bin/bash
sudo mkdir /jars
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geo_wrapper_2.11-0.3.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-1.2.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-sql_2.3-1.2.0.jar /jars/
sudo aws s3 cp s3://geospark-test-ds/bootstrap/geospark-viz_2.3-1.2.0.jar /jars/
If you are using an EMR Notebook You need a magic cell at the top of your notebook:
%%configure -f
{
"jars": [
        "s3://geospark-test-ds/bootstrap/geo_wrapper_2.11-0.3.0.jar",
        "s3://geospark-test-ds/bootstrap/geospark-1.2.0.jar",
        "s3://geospark-test-ds/bootstrap/geospark-sql_2.3-1.2.0.jar",
        "s3://geospark-test-ds/bootstrap/geospark-viz_2.3-1.2.0.jar"
    ]
}
                        I was seeing a similar kind of issue with SparkMeasure jars on Windows 10 machine
self.stagemetrics =
self.sc._jvm.ch.cern.sparkmeasure.StageMetrics(self.sparksession._jsparkSession)
TypeError: 'JavaPackage' object is not callable
So what I did was
Went to 'SPARK_HOME' via Pyspark shell, and installed the required jar
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.16
Grabbed that jar ( ch.cern.sparkmeasure_spark-measure_2.12-0.16.jar ) and copied into the the Jars folder of 'SPARK_HOME'
Reran the script and now it worked without that above error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With