Building a lemmatizer: speed optimization

Tags:

I am building a lemmatizer in python. As I need it to run in realtime/process fairly large amount of data the processing speed is of the essence. Data: I have all possible suffixes that are linked to all wordtypes that they can be combined with. Additionally I have lemmaforms that are linked to both their wordtype(s) and lemma(s). The program takes a word as input and outputs its lemma. word = lemmafrom + suffix

For example (Note: although the example is given in English I am not building a lemmatizer for English):

word: forbidding

lemmaform: forbidd

suffix: ing

lemma: forbid

My solution:

I have converted the data to (nested) dicts:

suffixdict : {suffix1:[type1,type2, ... , type(n)], suffix2:[type1,type2, ... ,
type(n)]}    
lemmaformdict : {lemmaform:{type1:lemma}}

1) Find all possible suffixes and word types that they are linked to. If the longest possible suffix is 3 characters long, the program tries to match 'ing', 'ng', 'n' to the keys in suffixdict. If the key exists it returns a value (a set of wordtypes).

2) For each matching suffix search the lemmaform from the dict. If lemmaform exists it returns the wordtypes.

3) Finally, the program tries to intersect the wordtypes produced in steps 1) ans 2) and if the intersection is sucessful it returns the lemma of the word.

My question: could there be a better solution to my problem from the prespective of speed? (Disregarding the option to keep frequent words and lemmas in the dictionary) Help much appriciated.

486

asked Mar 23 '12 17:03

root

2 Answers

This would be a wonderful application for finite state transducers. Why? Because they allow you to do string rewriting efficiently (in time linear to the size of the input). Consider the following s[ia]mple transducer:

enter image description here

It takes a string as input and checks whether there exists a path from the initial state (here, 0) to a final state (10, 12 and 17, respectively) given the sequence of input characters. If it reaches a final state, it produces the appropriate output, e.g. (forbidd, ing) if the input was "forbidding".

I don't know whether you have any background on finite state automata, though. If not, give them a try - it will be worth the effort. :) Tries are a special kind of finite state automaton (the sample transducer above is a trie), so they might be a good start.

191

answered Sep 22 '22 07:09

jena

Consider using a non-deterministic trie automaton covering the full set of recognized suffixes, but analyzing the word backwards. Being non-deterministic means the machine can be in multiple states at once, and the machine as a whole is in an accepting state if any of those states are accepting.

The initial state would be an accept state so it could recognize no suffix (as in English be). From the initial state, transitions (), ('e', 'z', 'i'), ('e', 'd', 'a') and ('e', 'v', 'o') for example would all arrive at accepting states, and you don't have to worry about the conflicting 'e's when using an NFA.

From the initial state the "characters" of each word are fed in backwards. Each time the machine lands in an accept state, the remaining part of the word is looked up in your lemmaformdict and all the results are kept. Then processing continues until the machine's state is null (not merely non-accepting).

At that point the total choices of lemmas indicated along the way lead to the possible interpretations of that word removed from context (and that should always be a small number).

Exactly how you implement an NFA will determine the performance. NFAs can be converted into DFAs once constructed, so that the machine has only one state at any given time, so that might help performance without complicating construction of the machine. On the downside, you have to work with the input on an individual character level, which for Python may cost you in performance. (But if performance is that precious, maybe you should switch to C++.)

answered Sep 22 '22 07:09

wberry

Related questions
                            
                                Behaviour of Mutlple inheritance in python
                            
                                What is the type of os.environ? and Why does it not support viewkeys method
                            
                                Platform-dependent performance issues when selecting a large number of files with gtk.FileChooserDialog
                            
                                python decorator for class OR function
                            
                                Python string as file argument to subprocess
                            
                                Unable to do heroku's Python tutorial within Dropbox folder
                            
                                ImportError: cannot import name linsolve
                            
                                Python CSV module - quotes go missing
                            
                                In python, produce HTML highlighting the differences of two simple strings
                            
                                Stop python from generating pyc files
                            
                                Parsing TCL lists in Python
                            
                                PyQt mouse events for QTabWidget
                            
                                Is python site manual totally generated by pydoc?
                            
                                Python Pandas Pivot Table
                            
                                How to order django-mptt tree by DateTimeField?
                            
                                griddata scipy interpolation not working (giving nan)
                            
                                how strings are stored by python in computers?
                            
                                Why tuple convention in function parameters?
                            
                                How to iterate through alpha and numeric numbers
                            
                                Capping linear regression prediction values using scikit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Building a lemmatizer: speed optimization

Tags:

python

optimization

nlp

lemmatization

root

People also ask

2 Answers

jena

wberry

Recent Activity

Donate For Us