Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch incorrectly indexing and querying on non-alphanumeric characters

My ElasticSearch index is not correctly indexing and querying non-alphanumeric characters. Specifically, dots and dashes are causing problems.

If I index a document with the name "O.K. Corral," it should match queries for "OK Corral". Similarly, if I index "Whiskey A Go-Go," I'd like it to match "Whiskey A GoGo" and "Whiskey A Go Go".

Right now, only queries with the correct dots and dashes will return these documents.

I'm hoping the solution will also solve any potential problems with other non-alphanumeric characters, like commas and apostrophes.

It sounds like a job for ElasticSearch token filters, but I haven't been able to find one that does what I'm looking for. Also, I would like to do this within ElasticSearch -- I don't want to write custom string manipulations to normalize data before it gets to my ES index.

Thanks for your help!

like image 734
Clay Wardell Avatar asked Aug 28 '12 23:08

Clay Wardell


1 Answers

You might want to have a look at the Word Delimiter Token Filter. It will at least do what you want with "Whiskey A GoGo" and "Whiskey A Go-Go,". You can check its behaviour in advance using the analyze api.

like image 93
javanna Avatar answered Oct 07 '22 05:10

javanna