I am building a lemmatizer in python. As I need it to run in realtime/process fairly large amount of data the processing speed is of the essence. Data: I have all possible suffixes that are linked to all wordtypes that they can be combined with. Additionally I have lemmaforms that are linked to both their wordtype(s) and lemma(s). The program takes a word as input and outputs its lemma. word = lemmafrom + suffix
For example (Note: although the example is given in English I am not building a lemmatizer for English):
word: forbidding
lemmaform: forbidd
suffix: ing
lemma: forbid
My solution:
I have converted the data to (nested) dicts:
suffixdict : {suffix1:[type1,type2, ... , type(n)], suffix2:[type1,type2, ... ,
type(n)]}
lemmaformdict : {lemmaform:{type1:lemma}}
1) Find all possible suffixes and word types that they are linked to. If the longest possible suffix is 3 characters long, the program tries to match 'ing', 'ng', 'n' to the keys in suffixdict. If the key exists it returns a value (a set of wordtypes).
2) For each matching suffix search the lemmaform from the dict. If lemmaform exists it returns the wordtypes.
3) Finally, the program tries to intersect the wordtypes produced in steps 1) ans 2) and if the intersection is sucessful it returns the lemma of the word.
My question: could there be a better solution to my problem from the prespective of speed? (Disregarding the option to keep frequent words and lemmas in the dictionary) Help much appriciated.
In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.
Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique.
In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
This would be a wonderful application for finite state transducers. Why? Because they allow you to do string rewriting efficiently (in time linear to the size of the input). Consider the following s[ia]mple transducer:
It takes a string as input and checks whether there exists a path from the initial state (here, 0) to a final state (10, 12 and 17, respectively) given the sequence of input characters. If it reaches a final state, it produces the appropriate output, e.g. (forbidd, ing) if the input was "forbidding".
I don't know whether you have any background on finite state automata, though. If not, give them a try - it will be worth the effort. :) Tries are a special kind of finite state automaton (the sample transducer above is a trie), so they might be a good start.
Consider using a non-deterministic trie automaton covering the full set of recognized suffixes, but analyzing the word backwards. Being non-deterministic means the machine can be in multiple states at once, and the machine as a whole is in an accepting state if any of those states are accepting.
The initial state would be an accept state so it could recognize no suffix (as in English be
). From the initial state, transitions ()
, ('e', 'z', 'i')
, ('e', 'd', 'a')
and ('e', 'v', 'o')
for example would all arrive at accepting states, and you don't have to worry about the conflicting 'e'
s when using an NFA.
From the initial state the "characters" of each word are fed in backwards. Each time the machine lands in an accept state, the remaining part of the word is looked up in your lemmaformdict
and all the results are kept. Then processing continues until the machine's state is null (not merely non-accepting).
At that point the total choices of lemmas indicated along the way lead to the possible interpretations of that word removed from context (and that should always be a small number).
Exactly how you implement an NFA will determine the performance. NFAs can be converted into DFAs once constructed, so that the machine has only one state at any given time, so that might help performance without complicating construction of the machine. On the downside, you have to work with the input on an individual character level, which for Python may cost you in performance. (But if performance is that precious, maybe you should switch to C++.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With