Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark-nlp : DocumentAssembler initializing failing with 'java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class'

I am trying out the ContenxtAwareSpellChecker provided in https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc

The first of the component in the pipeline is a DocumentAssembler

from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp


spark = sparknlp.start()
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

The above code when run fails as below

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\sparknlp\base.py", line 148, in __init__
    super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\sparknlp\internal.py", line 72, in __init__
    self._java_obj = self._new_java_obj(classname, self.uid)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\ml\wrapper.py", line 69, in _new_java_obj
    return java_obj(*java_args)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1569, in __call__
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\sql\utils.py", line 131, in deco
    return f(*a, **kw)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.nlp.DocumentAssembler.
: java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class
        at com.johnsnowlabs.nlp.DocumentAssembler.<init>(DocumentAssembler.scala:16)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

Edit: Apache Spark version is 2.4.6

like image 591
Abhishek P Avatar asked Nov 07 '22 05:11

Abhishek P


2 Answers

MLReadable or MLWritable errors in Apache Spark is always about the mismatch of Spark major versions. (more accurately something was compiled/shipped with one Scala version like 2.11.x and now is being used in another one like 2.12.x - like when you use Spark NLP artefacts that were supposed to be used in PySpark 3.x, but it's in PySpark 2.4.x or vice versa)

Spark NLP supports all major releases of Apache Spark (depending on which version of spark-nlp you are using). The matrix of compatibility is as follows:

Spark NLP Apache Spark 2.3.x Apache Spark 2.4.x Apache Spark 3.0.x Apache Spark 3.1.x
3.3.x YES YES YES YES
3.2.x YES YES YES YES
3.1.x YES YES YES YES
3.0.x YES YES YES YES
2.7.x YES YES NO NO
2.6.x YES YES NO NO
2.5.x YES YES NO NO
2.4.x Partially YES NO NO
1.8.x Partially YES NO NO
1.7.x YES NO NO NO
1.6.x YES NO NO NO
1.5.x YES NO NO NO

https://github.com/JohnSnowLabs/spark-nlp#apache-spark-support.

As you can see since Spark NLP 3.0.x, all the major Apache Spark releases are supported. Major releases such as 2.3.x (for those who are stuck in the old Hortonworks), 2.4.x (for those who are stuck in Cloudera 5.x/6.x), 3.0.x, and 3.1.x (on Databricks, EMR, and any other places which offer new Spark/PySpark releases).

Therefore, you don't need to downgrade/upgrade(if you can it's cool though!) your Apache Spark/PySpark in order to use any release of Spark NLP above 3.0.x. The key here is to find the correct Maven package to include in your SparkSession.

Manual SparkSession creation

For instance, if you want to use Spark NLP 3.3.4 release:

  • In Spark/PySpark 3.0.x/3.1.x: com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
  • In Spark/PySpark 2.4.x: com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
  • In Spark/PySpark 2.3.x: com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4

The actual name changes, the default is spark-nlp_2.12 for PySpark 3.0.x and 3.1.x, but for PySpark 2.4.x for instance it becomes spark-nlp-spark24_2.11 since it is based on Scala 2.11.

You can find the correct package for your version in the version's release notes. For Spark NLP 3.3.4 for example: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.3.4.

Auto SparkSession creation

If you wish to use sparknlp.start() function you can add the following flags to automatically start SparkSession with the correct Maven package:

import sparknlp

# for PySpark 3.0.x and 3.1.x
spark = sparknlp.start()

# for PySpark 2.4.x
spark = sparknlp.start(spark24=True)

# or for PySpark 2.3.x
spark = sparknlp.start(spark23=True)

I would like to also point out this discussion which talks about MLReadable or MLWritable error:

Why do I see serialVersionUID or MLReadable or MLWritable errors

Full Disclosure: I am one of the Spark NLP maintainers.

like image 72
Maziyar Avatar answered Nov 14 '22 22:11

Maziyar


I've had this issue when upgrading from spark 2.45, to spark 3+ (on Databricks with Scala though). Try downgrading your spark version.

like image 24
jamieoreillyg Avatar answered Nov 14 '22 22:11

jamieoreillyg