I am trying out the ContenxtAwareSpellChecker provided in https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc
The first of the component in the pipeline is a DocumentAssembler
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
spark = sparknlp.start()
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
The above code when run fails as below
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\__init__.py", line 110, in wrapper
return func(self, **kwargs)
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\sparknlp\base.py", line 148, in __init__
super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\__init__.py", line 110, in wrapper
return func(self, **kwargs)
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\sparknlp\internal.py", line 72, in __init__
self._java_obj = self._new_java_obj(classname, self.uid)
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\ml\wrapper.py", line 69, in _new_java_obj
return java_obj(*java_args)
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1569, in __call__
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\sql\utils.py", line 131, in deco
return f(*a, **kw)
File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.nlp.DocumentAssembler.
: java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class
at com.johnsnowlabs.nlp.DocumentAssembler.<init>(DocumentAssembler.scala:16)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Edit: Apache Spark version is 2.4.6
MLReadable
or MLWritable
errors in Apache Spark is always about the mismatch of Spark major versions. (more accurately something was compiled/shipped with one Scala version like 2.11.x and now is being used in another one like 2.12.x - like when you use Spark NLP artefacts that were supposed to be used in PySpark 3.x, but it's in PySpark 2.4.x or vice versa)
Spark NLP supports all major releases of Apache Spark (depending on which version of spark-nlp
you are using). The matrix of compatibility is as follows:
Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x |
---|---|---|---|---|
3.3.x | YES | YES | YES | YES |
3.2.x | YES | YES | YES | YES |
3.1.x | YES | YES | YES | YES |
3.0.x | YES | YES | YES | YES |
2.7.x | YES | YES | NO | NO |
2.6.x | YES | YES | NO | NO |
2.5.x | YES | YES | NO | NO |
2.4.x | Partially | YES | NO | NO |
1.8.x | Partially | YES | NO | NO |
1.7.x | YES | NO | NO | NO |
1.6.x | YES | NO | NO | NO |
1.5.x | YES | NO | NO | NO |
https://github.com/JohnSnowLabs/spark-nlp#apache-spark-support.
As you can see since Spark NLP 3.0.x, all the major Apache Spark releases are supported. Major releases such as 2.3.x
(for those who are stuck in the old Hortonworks), 2.4.x
(for those who are stuck in Cloudera 5.x/6.x), 3.0.x
, and 3.1.x
(on Databricks, EMR, and any other places which offer new Spark/PySpark releases).
Therefore, you don't need to downgrade/upgrade(if you can it's cool though!) your Apache Spark/PySpark in order to use any release of Spark NLP above 3.0.x. The key here is to find the correct Maven package to include in your SparkSession.
For instance, if you want to use Spark NLP 3.3.4
release:
com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4
The actual name changes, the default is spark-nlp_2.12
for PySpark 3.0.x and 3.1.x, but for PySpark 2.4.x for instance it becomes spark-nlp-spark24_2.11
since it is based on Scala 2.11.
You can find the correct package for your version in the version's release notes. For Spark NLP 3.3.4
for example: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.3.4.
If you wish to use sparknlp.start()
function you can add the following flags to automatically start SparkSession with the correct Maven package:
import sparknlp
# for PySpark 3.0.x and 3.1.x
spark = sparknlp.start()
# for PySpark 2.4.x
spark = sparknlp.start(spark24=True)
# or for PySpark 2.3.x
spark = sparknlp.start(spark23=True)
I would like to also point out this discussion which talks about MLReadable
or MLWritable
error:
Why do I see serialVersionUID or MLReadable or MLWritable errors
Full Disclosure: I am one of the Spark NLP maintainers.
I've had this issue when upgrading from spark 2.45, to spark 3+ (on Databricks with Scala though). Try downgrading your spark version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With