Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: Levenshtein sorting

I have a query that works sufficiently, but I want to sort the results of this by using levenshtein between the query param and the field in question.

Right now I'm doing the query in ES and then I do the sorting in my application. Right now I'm testing the script field in sort. This is the script

import  org.elasticsearch.common.logging.*;
ESLogger logger = ESLoggerFactory.getLogger('levenshtein_script');

def str1 = '%s'.split(' ').sort().join(' ');
def str2 = doc['%s'].values.join(' '); //Needed since the field is analyzed. This will change when I reindex the data.
def dist = new int[str1.size() + 1][str2.size() + 1]
(0..str1.size()).each { dist[it][0] = it }
(0..str2.size()).each { dist[0][it] = it }
(1..str1.size()).each { i ->
   (1..str2.size()).each { j ->
       dist[i][j] = [dist[i - 1][j] + 1, dist[i][j - 1] + 1, dist[i - 1][j - 1] + ((str1[i - 1] == str2[j - 1]) ? 0 : 1)].min()
   }
}
def result = dist[str1.size()][str2.size()]
logger.info('Query param: ['+str1+'] | Term: ['+str2+'] | Result: ['+result+']');
return result;

Basically this is a template (check the %s) that I fill in my application like this

sortScript = String.format(EDIT_DISTANCE_GROOVY_FUNC, fullname, FULLNAME_FIELD_NAME);

The problem is this http://code972.com/blog/2015/03/84-elasticsearch-one-tip-a-day-avoid-costly-scripts-at-all-costs. Which is understandable.

My question is, how can I do what I need (sort the results by levenshtein) inside elasticsearch so I can avoid the overhead in my application. Can I use lucene expressions for this? Do you have an example? Is there some other way that I can accomplish this?

I'm using ElasticSearch 1.7.5 as a service. So native plugins should not be the first solution (I don't know even if it's possible, I'll have to check with my provider, but if it's the only viable solution I will do just that).

UPDATE

So it seems a good solution would be to save it in config/scripts folder as it will be compiled once https://www.elastic.co/blog/running-groovy-scripts-without-dynamic-scripting. The script can be indexed instead of saving it https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html . This is much more convenient for my use case. Does this have the same behaviour regarding the compilation of the script? Will it be compiled only once?

like image 457
Alkis Kalogeris Avatar asked Sep 12 '16 03:09

Alkis Kalogeris


1 Answers

It's important to note that Groovy is deprecated in Elasticsearch 5.x and it will be removed in Elasticsearch 6.0. You will either want to look at using Painless scripting to replace this functionality or create a native Java script that possibly uses Lucene's LuceneLevenshteinDistance to do this for you.

Your script is also pretty scary in that it adds a number of loops (mostly hidden by Groovy helpers) and potentially large memory allocations into the mix. I have serious doubt's about its performance at scale.

I also noticed the presence of %s in the script, which I assume means that your own code replaces the field name dynamically. You should always use params for this purpose, then use the parameter as a variable in the script. This avoids having to compile a version of a script per field name. (I expect you had to do this to make it file-based)

Does this have the same behaviour regarding the compilation of the script?

Yes, file-based scripts are the most secure (because they require access to the machine itself to install). File-based scripts are compiled, just like inline and index-based scripts.

The downside to file-based scripts is that you need to add them to every node. Not on that, but every node needs the same version of the script. This means that, if you ever choose to update it, then it's better to add a new script and reference it, rather than to replace it.

File-based scripts are picked up every 60 seconds by default.

Will it be compiled only once?

Yes, per node.

like image 69
pickypg Avatar answered Nov 11 '22 05:11

pickypg