Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building a lemmatizer: speed optimization

I am building a lemmatizer in python. As I need it to run in realtime/process fairly large amount of data the processing speed is of the essence. Data: I have all possible suffixes that are linked to all wordtypes that they can be combined with. Additionally I have lemmaforms that are linked to both their wordtype(s) and lemma(s). The program takes a word as input and outputs its lemma. word = lemmafrom + suffix

For example (Note: although the example is given in English I am not building a lemmatizer for English):

word: forbidding

lemmaform: forbidd

suffix: ing

lemma: forbid

My solution:

I have converted the data to (nested) dicts:

suffixdict : {suffix1:[type1,type2, ... , type(n)], suffix2:[type1,type2, ... ,
type(n)]}    
lemmaformdict : {lemmaform:{type1:lemma}}

1) Find all possible suffixes and word types that they are linked to. If the longest possible suffix is 3 characters long, the program tries to match 'ing', 'ng', 'n' to the keys in suffixdict. If the key exists it returns a value (a set of wordtypes).

2) For each matching suffix search the lemmaform from the dict. If lemmaform exists it returns the wordtypes.

3) Finally, the program tries to intersect the wordtypes produced in steps 1) ans 2) and if the intersection is sucessful it returns the lemma of the word.

My question: could there be a better solution to my problem from the prespective of speed? (Disregarding the option to keep frequent words and lemmas in the dictionary) Help much appriciated.

like image 486
root Avatar asked Mar 23 '12 17:03

root


People also ask

How do you implement Lemmatization?

In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.

What is the best Lemmatizer?

Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique.

What is Lemmatization give an example?

In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.


2 Answers

This would be a wonderful application for finite state transducers. Why? Because they allow you to do string rewriting efficiently (in time linear to the size of the input). Consider the following s[ia]mple transducer:

enter image description here

It takes a string as input and checks whether there exists a path from the initial state (here, 0) to a final state (10, 12 and 17, respectively) given the sequence of input characters. If it reaches a final state, it produces the appropriate output, e.g. (forbidd, ing) if the input was "forbidding".

I don't know whether you have any background on finite state automata, though. If not, give them a try - it will be worth the effort. :) Tries are a special kind of finite state automaton (the sample transducer above is a trie), so they might be a good start.

like image 191
jena Avatar answered Sep 22 '22 07:09

jena


Consider using a non-deterministic trie automaton covering the full set of recognized suffixes, but analyzing the word backwards. Being non-deterministic means the machine can be in multiple states at once, and the machine as a whole is in an accepting state if any of those states are accepting.

The initial state would be an accept state so it could recognize no suffix (as in English be). From the initial state, transitions (), ('e', 'z', 'i'), ('e', 'd', 'a') and ('e', 'v', 'o') for example would all arrive at accepting states, and you don't have to worry about the conflicting 'e's when using an NFA.

From the initial state the "characters" of each word are fed in backwards. Each time the machine lands in an accept state, the remaining part of the word is looked up in your lemmaformdict and all the results are kept. Then processing continues until the machine's state is null (not merely non-accepting).

At that point the total choices of lemmas indicated along the way lead to the possible interpretations of that word removed from context (and that should always be a small number).

Exactly how you implement an NFA will determine the performance. NFAs can be converted into DFAs once constructed, so that the machine has only one state at any given time, so that might help performance without complicating construction of the machine. On the downside, you have to work with the input on an individual character level, which for Python may cost you in performance. (But if performance is that precious, maybe you should switch to C++.)

like image 34
wberry Avatar answered Sep 22 '22 07:09

wberry