It seems my Google-fu is failing me.
Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful.
This is called lemmatization, and what you call the "base of a word" is called a lemma. morpha and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.
(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)
Edit: since you're going to use this for search, here's a few tips:
(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)
This isn't exactly what you're asking for, but Wikipedia on stemming was enlightening and contains a number of links to free stemming programs. Which presumably should include lists of word stems
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With