I'm looking for a filter in elasticsearch that will let me break english compound words into their constituent parts, so for example for a term like eyewitness
, eye witness
and eyewitness
as queries would both match eyewitness
. I noticed the compound word filter, but this requires explicity defining a word list, which I couldn't possibly come up with on my own.
First, you need to ask yourself if you really need to break the compound words. Consider a simpler approach like using "edge n-grams" to hit in the leading or trailing edges. It would have the side effect of loosely hitting on fragments like "ey", but maybe that would be acceptable for your situation.
If you do need to break the compounds, and want to explicitly index the word fragments, the you'll need to get a word list. You can download a list English words, one example is here. The dictionary word list is used to know which fragments of the compound words are actually words themselves. This will add overhead to your indexing, so be sure to test it. An example showing the usage is here.
If your text is German, consider https://github.com/jprante/elasticsearch-analysis-decompound
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With