Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unable to download the pipeline provided by spark-nlp library

i am unable to use the predefined pipeline "recognize_entities_dl" provided by the spark-nlp library

i tried installing different versions of pyspark and spark-nlp library

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

#create or get Spark Session

spark = sparknlp.start()

sparknlp.version()
spark.version

#download, load, and annotate a text by pre-trained pipeline

pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')
result = pipeline.annotate('Harry Potter is a great movie')

2.1.0
recognize_entities_dl download started this may take some time.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-b71a0f77e93a> in <module>
     11 #download, load, and annotate a text by pre-trained pipeline
     12 
---> 13 pipeline = PretrainedPipeline('recognize_entities_dl', 'en')
     14 result = pipeline.annotate('Harry Potter is a great movie')

d:\python36\lib\site-packages\sparknlp\pretrained.py in __init__(self, name, lang, remote_loc)
     89 
     90     def __init__(self, name, lang='en', remote_loc=None):
---> 91         self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
     92         self.light_model = LightPipeline(self.model)
     93 

d:\python36\lib\site-packages\sparknlp\pretrained.py in downloadPipeline(name, language, remote_loc)
     50     def downloadPipeline(name, language, remote_loc=None):
     51         print(name + " download started this may take some time.")
---> 52         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
     53         if file_size == "-1":
     54             print("Can not find the model to download please check the name!")

AttributeError: module 'sparknlp.internal' has no attribute '_GetResourceSize'
like image 310
bhawana Avatar asked Oct 23 '19 12:10

bhawana


Video Answer


1 Answers

Thanks for confirming your Apache Spark version. The pre-trained pipelines and models are based on Apache Spark and Spark NLP versions. The lowest Apache Spark version must be 2.4.x to be able to download the pre-trained models/pipelines. Otherwise, you need to train your own models/pipelines for any version before.

This is the list of all pipelines and they all for Apache Spark 2.4.x: https://nlp.johnsnowlabs.com/docs/en/pipelines

If you take a look at the URL of any models or pipelines you can see this information:

recognize_entities_dl_en_2.1.0_2.4_1562946909722.zip

  • Name: recognize_entities_dl
  • Lang: en
  • Spark NLP: must be equal to 2.1.0 or greater
  • Apache Spark: equal to 2.4.x or greater

NOTE: The Spark NLP library is being built and compiled against Apache Spark 2.4.x. That is why models and pipelines are being only available for the 2.4.x version.

NOTE 2: Since you are using Windows, you need to use _noncontrib models and pipelines which are compatible with Windows: Do Spark-NLP pretrained pipelines only work on linux systems?

I hope this answer helps and solves your issue.

UPDATE April 2020: Apparently the models and pipelines trained and uploaded on Apache Spark 2.4.x are compatible with Apache Spark 2.3.x as well. So if you are on Apache Spark 2.3.x even though you cannot use pretrained() for auto-download you can download it manually and just use .load() instead.

Full list of all models and pipelines with links to download: https://github.com/JohnSnowLabs/spark-nlp-models

Update: After 2.4.0 release, all the models and pipelines are cross-platform and there is no need to choose a different model/pipeline for any specific OS: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.4.0

For newer releases: https://github.com/JohnSnowLabs/spark-nlp/releases

like image 137
Maziyar Avatar answered Oct 19 '22 05:10

Maziyar