ElasticSearch incorrectly indexing and querying on non-alphanumeric characters

Question

My ElasticSearch index is not correctly indexing and querying non-alphanumeric characters. Specifically, dots and dashes are causing problems.

If I index a document with the name "O.K. Corral," it should match queries for "OK Corral". Similarly, if I index "Whiskey A Go-Go," I'd like it to match "Whiskey A GoGo" and "Whiskey A Go Go".

Right now, only queries with the correct dots and dashes will return these documents.

I'm hoping the solution will also solve any potential problems with other non-alphanumeric characters, like commas and apostrophes.

It sounds like a job for ElasticSearch token filters, but I haven't been able to find one that does what I'm looking for. Also, I would like to do this within ElasticSearch -- I don't want to write custom string manipulations to normalize data before it gets to my ES index.

Thanks for your help!

javanna · Accepted Answer

You might want to have a look at the Word Delimiter Token Filter. It will at least do what you want with "Whiskey A GoGo" and "Whiskey A Go-Go,". You can check its behaviour in advance using the analyze api.

ElasticSearch incorrectly indexing and querying on non-alphanumeric characters

Tags:

string

elasticsearch

normalization

Clay Wardell

1 Answers

javanna

Recent Activity

Donate For Us

ElasticSearch incorrectly indexing and querying on non-alphanumeric characters

Tags:

string

elasticsearch

normalization

Clay Wardell

1 Answers

javanna

Related questions

Recent Activity

Donate For Us