Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

all possible wordform completions of a (biomedical) word's stem

I'm familiar with word stemming and completion from the tm package in R.

I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".

If I had to do it right now, I would probably just go with something like:

library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"), 
    ignore.case = T, x = dictionary, value = T)

I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.

I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.

like image 625
Mark Miller Avatar asked Dec 10 '25 17:12

Mark Miller


1 Answers

This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.

from collections import defaultdict
from stemming.porter2 import stem

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()

stems = defaultdict(list)

for word in words:
    word_stem = stem(word)
    stems[word_stem].append(word)

if __name__ == '__main__':
    word = 'leukocyte'
    word_stem = stem(word)
    print(stems[word_stem])

For the /usr/share/dict/words corpus, this produces the result

['leukocyte', "leukocyte's", 'leukocytes']

It uses the stemming module that can be installed with

pip install stemming
like image 101
BioGeek Avatar answered Dec 12 '25 07:12

BioGeek



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!