Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elasticsearch - breaking english compound words?

I'm looking for a filter in elasticsearch that will let me break english compound words into their constituent parts, so for example for a term like eyewitness, eye witness and eyewitness as queries would both match eyewitness. I noticed the compound word filter, but this requires explicity defining a word list, which I couldn't possibly come up with on my own.

like image 860
Lucifer N. Avatar asked Jul 28 '14 04:07

Lucifer N.


1 Answers

First, you need to ask yourself if you really need to break the compound words. Consider a simpler approach like using "edge n-grams" to hit in the leading or trailing edges. It would have the side effect of loosely hitting on fragments like "ey", but maybe that would be acceptable for your situation.

If you do need to break the compounds, and want to explicitly index the word fragments, the you'll need to get a word list. You can download a list English words, one example is here. The dictionary word list is used to know which fragments of the compound words are actually words themselves. This will add overhead to your indexing, so be sure to test it. An example showing the usage is here.

If your text is German, consider https://github.com/jprante/elasticsearch-analysis-decompound

like image 149
Andy Avatar answered Oct 22 '22 15:10

Andy