Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate all word forms using Lucene & Hunspell

In an application I work on, we use Lucene Analyzer, especially it's Hunspell part. The problem I face is: I need to generate all word forms of a word, using a set of affix rules.

E.g. having the word 'educate' and affix rules ABC, generate all forms of word 'educate.' - educates, educated, educative, etc.

What I'd like to know is: is it possible to do this using Lucene's Hunspell implementation (we use a Hunspell dictionary (.dic) and affix file (.aff), so it has to be a Hunspell API)? Lucene's Hunspell API isn't that big, I went through it, and didn't find something suitable.

Nearest I could find on SO was this, but there are no answers related to hunspell.

Update 1 I'm not working on the project where I faced the above anymore, but if there still is a solution to do this using Lucene's Analyzer, I'd be glad that the community will see the answer.

like image 632
Haris Osmanagić Avatar asked Dec 05 '12 14:12

Haris Osmanagić


2 Answers

I think what you're looking for is Hunspell's wordforms command:

Usage: wordforms [-s | -p] dictionary.aff dictionary.dic word
-s: print only suffixed forms
-p: print only prefixed forms

Example:

$ wordforms en_US.aff en_US.dic educate
educating
educated
educate
educates
educates

Read more in the documentation.

like image 156
Pillowcase Avatar answered Oct 05 '22 01:10

Pillowcase


(The original question was about generating all forms for one given word. This answer focuses on the harder problem of generating all forms for all words of a dictionary. I post this here as this is what comes up when searching for the harder problem.)

Update on unmunching

As of 2021, Hunspell provides two tools which are called unmunch and wordforms for generating word forms. Their respective usage is:

# print all forms for all words whose roots are given in `roots.dic`
# and make use of affix rules defined in `affixes.aff`:
unmunch   roots.dic affixes.aff
# print the forms of ONE given word (a single root with no affix rule)
# which are allowed by the reference dictionary defined by the pair of
# `roots.dic` and `affixes.aff`:
wordforms affixes.aff roots.dic word

So affixes.aff would be given by your language, and roots.dic would be either a reference dictionary for your language, or a custom dictionary with the roots of the new words you want to generate.

Unfortunately, Hunspell’s unmunch is deprecated¹ and does not work properly. It is inherited from MySpell, and my guess is that it does not support all features of Hunspell. Apparently it does not properly support UTF-8. When I tried using it with the reference French dictionary (Dicollecte, v7.0), it generated garbage words by applying affix rules it was not supposed to apply (such as: conjugating non-verbs).

wordforms should be more up-to-date, so you might try to emulate unmunch with wordforms (as the README suggests), but the latter only takes one unqualified root, and compares it against the whole dictionary implied by roots.dic and affixes.aff. This takes a lot of time per root and, worst, you would have to call wordforms in turn with all the roots in roots.dic. So you would have a quadratic time. For me, with the reference set of affixes for French, this is slow to the point of being unusable—even with only 10 roots! The unusable Bash code is, for illustration:

# /!\ EXTREMELY SLOW
aff='affixes.aff'
dic='roots.dic'
cat "$dic" | while read -r root ; do # read each root of the file
    root="${root%%/*}" # strip the root from the optional slash (attached affix rules)
    wordforms "$aff" "$dic" "$root" # generate all forms for this root
done \
| sort -u # sort (according to the locale) and remove duplicates

Also, note that wordforms produces bare words, while unmunch was able to attach derived metadata (such as part-of-speech or gender), so with wordforms you lose information (which may or may not matter to you).

The lack of a replacement for unmunch is a known issue. Apparently Hunspell developers will not address it in a predictable future (something about funding?). This has led to several people reimplementing the functionality, you’ll find pointers throughout GitHub issues.

  • In 2012 someone wrote an sh/awk script by adapting the source code of wordforms; maybe severely outdated, but I haven’t tried it.
  • In 2014 someone wrote another sh/awk script to treat a Hindi dictionary; it worked for me, at least better than the built-in unmunch. I don’t know how accurate it is though.
  • In December 2020 someone wrote a Perl module and a Perl program; looks great, but I’m not sure how to use them.

¹ From the repo’s README.

like image 42
Maëlan Avatar answered Oct 05 '22 02:10

Maëlan