I want to build an application where a match requires each token in the document to be contained in the query at least once!!!
Please note its the other way around than the standard expectation. So documents are now fairly small while queries can be very long. Example:
Document:
"elastic super cool".
A valid query match would be
"I like elastic things since elasticsearch is super cool"
I managed to get the number of matched tokens from elastic search (see also https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/ttJTE52hXf8). So in above example 3 matches (= length of document) would mean the query matches.
But how can I combine this with synonyms???
Suppose a synonym for "cool" would be "nice", "great" and "good". By using a synonym token filter, I managed to add the synonoms to each position in the document.
Hence, the following four documents each have 3 token matches for the query above:
"elastic super nice"
"elastic nice cool"
"nice good great"
"good great cool"
But only the first match is a valid match!
How can I avoid that each synonym match counts as one match although they represent the same token in the document?
Any ideas how to tackle this problem?
I read that percolators might address this issue, but I am still not sure whether perculators would work with synonyms the way I want it...
Ideas?
I assume you expand the synonyms. You can use scripting to count the matching positions.
Elasticsearch Google Group with a solution by Vineeth Mohan
I adapted his script as a native script, that returns a number between 0 and 1 for the ratio of matched positions in the field. I tweaked it a bit to match only one position per query
You need a field that contains the number of positions, for example by using token_count which actually counts the number of positions
@Override
public Object run()
{
IndexField indexField = this.indexLookup().get(field);
Long numberOfPositions = ((ScriptDocValues.Longs) doc().get(positionsField)).getValue();
ArrayList<Integer> positions = new ArrayList<Integer>();
for (String term : terms)
{
Iterator<TermPosition> termPos = indexField.get(term, IndexLookup.FLAG_POSITIONS | IndexLookup.FLAG_CACHE)
.iterator();
while (termPos.hasNext())
{
int position = termPos.next().position;
if (positions.contains(position))
{
continue;
}
positions.add(position);
// if the term matches multiple positions, only a new position should count
break;
}
}
return positions.size() * 1.0 / numberOfPositions;
}
You can than use it in your query as a function_score script.
{
"function_score": {
"query": {
"match": {
"message": "I like elastic things since elasticsearch is super cool"
}
},
"script_score": {
"params": {
"terms": [
"I",
"like",
"elastic",
"things",
"since",
"elasticsearch",
"is",
"super",
"cool"
],
"field": "message",
"positions_field": "message.pos_count"
},
"lang": "native",
"script": "matched_positions_ratio"
},
"boost_mode": "replace"
}
}
You may then set "min_score" to 1 and only get documents that match all positions in the given field.
I hope this solution is what you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With