Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Penalising - but not eliminating duplicates - in ElasticSearch

I have some data with duplicate fields. I don't want duplicates to appear together on top of search results, but I don't want to eliminate them altogether. I just want to get a better variety, so the 2nd, 3rd ... nth occurrence of the same field-value would be demoted. Is that possible with ElasticSearch?

For example:

curl -XPOST 'http://localhost:9200/employeeid/info/1' -d '{
 "name": "John",
 "organisation": "Apple",
 "importance": 1000
}'

curl -XPOST 'http://localhost:9200/employeeid/info/2' -d '{
 "name":"John",
 "organisation":"Apple",
 "importance": 2000
 }'

curl -XPOST 'http://localhost:9200/employeeid/info/3' -d '{
 "name": "Sam",
 "organisation": "Apple",
 "importance": 0
 }'

(based on this)

If we assume search is boosted by importance, the natural result for "Apple" search would be John, John, Sam. What I am looking for is a way to make the result John, Sam, John, ie penalise the second John because another John has already appeared.

like image 697
mahemoff Avatar asked Feb 16 '17 16:02

mahemoff


People also ask

How do I deduplicate documents in Elasticsearch?

In this blog post we have demonstrated two methods for deduplication of documents in Elasticsearch. The first method uses Logstash to remove duplicate documents, and the second method uses a custom Python script to find and remove duplicate documents.

Can Logstash detect duplicate documents in Elasticsearch?

Given this example document structure, for the purposes of this blog we arbitrarily assume that if multiple documents have the same values for the [“CAC”, “FTSE”, “SMI”] fields that they are duplicates of each other. Logstash may be used for detecting and removing duplicate documents from an Elasticsearch index.

Can a document be stored multiple times in Elasticsearch with different _ID?

However, if the data source accidentally sends the same document to Elasticsearch multiple times, and if such auto-generated _id values are used for each document that Elasticsearch inserts, then this same document will be stored multiple times in Elasticsearch with different _id values.

How to reduce the number of shards in an Elasticsearch index?

If traffic volumes on the other hand have been too low, resulting in unusually small shards, the shrink index API can be used to reduce the number of shards in the index. As you have seen in this blog post, it is possible to prevent duplicates in Elasticsearch by specifying a document identifier externally prior to indexing data into Elasticsearch.


1 Answers

You could adjust the importance field at index time by finding all duplicates and choosing one of the duplicates to be 'more important' - maybe the duplicate with the highest score is chosen. From your example, I would add 5000 to the existing value of importance.

The results would now rank as follows.

John/Apple-7000, Sam/Apple-5000, John/Apple-1000

But this means you would need to re-index if you decided to change the 5000 to 10000 to adjust the scoring as it depends on the magnitude of importance.

Alternatively, you could add another field called 'authority' for which you could give a value of 1 for the duplicate with the highest importance and use a scoring function to provide a step at query-time :-

"script_score": {
   "script": "(_score * 5000) + doc['importance'].value + (doc['authority'].value * 5000)"
}

Note that the multiplier for _score depends on the original ranking algorithm, this assumes a value for _score from 0.0 to 1.0

like image 181
abdollar Avatar answered Nov 16 '22 03:11

abdollar