Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Arabic lemmatization and Stanford NLP

I try to make lemmatization, ie identifying the lemma and possibly the Arabic root of a verb, for example: يتصل ==> lemma (infinitive of the verb) ==> اتصل ==> root (triliteral root / Jidr thoulathi) ==> و ص ل

Do you think Stanford NLP can do that?

Best Regards,

like image 382
Riadh Belkebir Avatar asked Mar 19 '15 17:03

Riadh Belkebir


People also ask

What is lemmatization in NLP?

Lemmatization is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode. Lemmatization is responsible for grouping different inflected forms of words into the root form, having the same meaning.

Which is better lemmatization vs stemming?

Stemming is a faster process than lemmatization as stemming chops off the word irrespective of the context, whereas the latter is context-dependent. Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach. Lemmatization has higher accuracy than stemming.

Should I use both stemming and lemmatization?

Short answer- go with stemming when the vocab space is small and the documents are large. Conversely, go with word embeddings when the vocab space is large but the documents are small. However, don't use lemmatization as the increased performance to increased cost ratio is quite low.

What is the main difference between stemming and lemmatization?

Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.


1 Answers

The Stanford Arabic segmenter can't do true lemmatization. However, it is possible to train a new model to do something like stemming:

  • تكتبون ← ت+ كتب +ون
  • يتصل ← ي+ تصل

If it is very important that the output is real Arabic lemmas ("تصل" is not a true lemma), you might be better off with a tool like MADAMIRA (http://nlp.ldeo.columbia.edu/madamira/).

Elaboration: The Stanford Arabic segmenter produces its output character-by-character using only these operations (implemented in edu.stanford.nlp.international.arabic.process.IOBUtils):

  • Split a word between two characters
  • Transform lil- (للـ) into li+ al- (ل+ الـ)
  • Transform ta (ت) or ha (ه) into ta marbuta (ة)
  • Transform ya (ي) or alif (ا) into alif maqsura (ى)
  • Transform alif maqsura (ى) into ya (ي)

So lemmatizing يتصل to ي+ اتصل would require implementing an extra rule, i.e., to insert an alif after ya or ta. Lemmatization of certain irregular forms would be completely impossible (for example, نساء ← امرأة).

The version of the Stanford segmenter available for download also only breaks off pronouns and particles:

وسيكتشفونه ← و+ س+ يكتشفون +ه

However, if you have access to the LDC Arabic Treebank or a similarly rich source of Arabic text with morphological segmentation annotated, it is possible to train your own model to remove all morphological affixes, which is closer to lemmatization:

وسيكتشفونه ← و+ س+ ي+ كتشف +ون +ه

Note that "كتشف" is not a real Arabic word, but the segmenter should at least consistently produce "كتشف" for تكتشفين ,أكتشف ,يكتشف, etc. If this is acceptable, you would need to change the ATB preprocessing script to instead use the morphological segmentation annotations. You could do this by replacing the script called parse_integrated with a modified version like this: https://gist.github.com/futurulus/38307d98992e7fdeec0d

Then follow the instructions for "TRAINING THE SEGMENTER" in the README.

like image 197
futurulus Avatar answered Sep 28 '22 16:09

futurulus