Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How should we use the setDictionary for the lemmatization annotator in Spark-NLP?

I have a requirement where I have to add a dictionary in the lemmatization step. While trying to use it in a pipeline and doing pipeline.fit() I get a arrayIndexOutOfBounds exception. What is the correct way to implement this? are there any examples?

I am passing token as the inputcol for lemmatization and lemma as the outputcol. Following is my code:

    // DocumentAssembler annotator
    val document = new DocumentAssembler()
    // SentenceDetector annotator
    val sentenceDetector = new SentenceDetector()
    // tokenizer annotaor
    val token = new Tokenizer()
    import com.johnsnowlabs.nlp.util.io.ExternalResource
     // lemmatizer annotator
    val lemmatizer = new Lemmatizer()
    val pipeline = new Pipeline().setStages(Array(document,sentenceDetector,token,lemmatizer))
    val result= pipeline.fit(df).transform(df)

The error message is:

    Name: java.lang.ArrayIndexOutOfBoundsException
    Message: 1
    StackTrace:   at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:315)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:312)
      at scala.collection.Iterator$class.foreach(Iterator.scala:891)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$.flattenRevertValuesAsKeys(ResourceHelper.scala:312)
      at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:52)
      at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:19)
      at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:45)
      at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
      at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
      at scala.collection.Iterator$class.foreach(Iterator.scala:891)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
      at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
      at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
      at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
like image 384
StuckProgrammer Avatar asked Sep 10 '19 11:09


People also ask

What's the difference between an annotator model and annotator approach?

There are two types of Annotators: Approach: AnnotatorApproach extend Estimators, which are meant to be trained through fit() Model: AnnotatorModel extend from Transformers, which are meant to transform DataFrames through transform()

What is spark NLP used for?

Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It provides an easy API to integrate with ML Pipelines and it is commercially supported by John Snow Labs.

1 Answers

Your pipeline looks good to me so everything depends on what is inside lemmas001.txt and are you being able to access it on Windows.

NOTE: I have seen users on Windows using this inside Apache Spark:


How to train Lemmatizer in Spark NLP is simple:

val lemmatizer = new Lemmatizer()
    .setDictionary("AntBNC_lemmas_ver_001.txt", "->", "\t")

The file must have the following format where the keyDelimiter in this case is -> and the valueDelimiter is \t:

abnormal    ->  abnormal    abnormals
abode   ->  abode   abodes
abolish ->  abolishing  abolished   abolish abolishes
abolitionist    ->  abolitionist    abolitionists
abominate   ->  abominate   abominated  abominates
abomination ->  abomination abominations
aboriginal  ->  aboriginal  aboriginals
aborigine   ->  aborigines  aborigine
abort   ->  aborted abort   aborts  aborting
abortifacient   ->  abortifacients  abortifacient
abortionist ->  abortionist abortionists
abortion    ->  abortion    abortions
abo ->  abo abos
abotrite    ->  abotrites   abotrite
abound  ->  abound  abounds abounding   abounded

Also, if you don't want to train your own Lemmatizer, you can use the pre-trained models as follow:


val lemmatizer = new LemmatizerModel.pretrained(name="lemma_antbnc", lang="en")


val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="fr")


val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="it")


val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="de")

List of all pre-trained models is here: https://nlp.johnsnowlabs.com/docs/en/models

List of all pre-trained pipelines is here: https://nlp.johnsnowlabs.com/docs/en/pipelines

Please let me know in the comment if you have more questions.

Full disclosure: I am one of the contributors of Spark NLP library.

Update: I found this example for you in Scala on Databricks in case you are interested (This is actually how they trained Italian Lemmatizer model)

like image 74
Maziyar Avatar answered Nov 09 '22 14:11
