I have a requirement where I have to add a dictionary in the lemmatization step. While trying to use it in a pipeline and doing pipeline.fit() I get a arrayIndexOutOfBounds exception. What is the correct way to implement this? are there any examples?
I am passing token as the inputcol for lemmatization and lemma as the outputcol. Following is my code:
// DocumentAssembler annotator
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// SentenceDetector annotator
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
// tokenizer annotaor
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
import com.johnsnowlabs.nlp.util.io.ExternalResource
// lemmatizer annotator
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary(ExternalResource("C:/data/notebook/lemmas001.txt","LINE_BY_LINE",Map("keyDelimiter"->",","valueDelimiter"->"|")))
val pipeline = new Pipeline().setStages(Array(document,sentenceDetector,token,lemmatizer))
val result= pipeline.fit(df).transform(df)
The error message is:
Name: java.lang.ArrayIndexOutOfBoundsException
Message: 1
StackTrace: at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:315)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$.flattenRevertValuesAsKeys(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:52)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:19)
at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:45)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
There are two types of Annotators: Approach: AnnotatorApproach extend Estimators, which are meant to be trained through fit() Model: AnnotatorModel extend from Transformers, which are meant to transform DataFrames through transform()
Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It provides an easy API to integrate with ML Pipelines and it is commercially supported by John Snow Labs.
Your pipeline looks good to me so everything depends on what is inside lemmas001.txt
and are you being able to access it on Windows.
NOTE: I have seen users on Windows using this inside Apache Spark:
"C:\\Users\\something\\Desktop\\someDirectory\\somefile.txt"
How to train Lemmatizer
in Spark NLP is simple:
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary("AntBNC_lemmas_ver_001.txt", "->", "\t")
The file must have the following format where the keyDelimiter
in this case is ->
and the valueDelimiter
is \t
:
abnormal -> abnormal abnormals
abode -> abode abodes
abolish -> abolishing abolished abolish abolishes
abolitionist -> abolitionist abolitionists
abominate -> abominate abominated abominates
abomination -> abomination abominations
aboriginal -> aboriginal aboriginals
aborigine -> aborigines aborigine
abort -> aborted abort aborts aborting
abortifacient -> abortifacients abortifacient
abortionist -> abortionist abortionists
abortion -> abortion abortions
abo -> abo abos
abotrite -> abotrites abotrite
abound -> abound abounds abounding abounded
Also, if you don't want to train your own Lemmatizer, you can use the pre-trained models as follow:
English
val lemmatizer = new LemmatizerModel.pretrained(name="lemma_antbnc", lang="en")
.setInputCols(Array("token"))
.setOutputCol("lemma")
French
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="fr")
.setInputCols(Array("token"))
.setOutputCol("lemma")
Italian
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="it")
.setInputCols(Array("token"))
.setOutputCol("lemma")
German
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="de")
.setInputCols(Array("token"))
.setOutputCol("lemma")
List of all pre-trained models is here: https://nlp.johnsnowlabs.com/docs/en/models
List of all pre-trained pipelines is here: https://nlp.johnsnowlabs.com/docs/en/pipelines
Please let me know in the comment if you have more questions.
Full disclosure: I am one of the contributors of Spark NLP library.
Update: I found this example for you in Scala on Databricks in case you are interested (This is actually how they trained Italian Lemmatizer model)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With