I have some data with duplicate fields. I don't want duplicates to appear together on top of search results, but I don't want to eliminate them altogether. I just want to get a better variety, so the 2nd, 3rd ... nth occurrence of the same field-value would be demoted. Is that possible with ElasticSearch?
For example:
curl -XPOST 'http://localhost:9200/employeeid/info/1' -d '{
"name": "John",
"organisation": "Apple",
"importance": 1000
}'
curl -XPOST 'http://localhost:9200/employeeid/info/2' -d '{
"name":"John",
"organisation":"Apple",
"importance": 2000
}'
curl -XPOST 'http://localhost:9200/employeeid/info/3' -d '{
"name": "Sam",
"organisation": "Apple",
"importance": 0
}'
(based on this)
If we assume search is boosted by importance, the natural result for "Apple" search would be John
, John
, Sam
. What I am looking for is a way to make the result John
, Sam
, John
, ie penalise the second John
because another John
has already appeared.
In this blog post we have demonstrated two methods for deduplication of documents in Elasticsearch. The first method uses Logstash to remove duplicate documents, and the second method uses a custom Python script to find and remove duplicate documents.
Given this example document structure, for the purposes of this blog we arbitrarily assume that if multiple documents have the same values for the [“CAC”, “FTSE”, “SMI”] fields that they are duplicates of each other. Logstash may be used for detecting and removing duplicate documents from an Elasticsearch index.
However, if the data source accidentally sends the same document to Elasticsearch multiple times, and if such auto-generated _id values are used for each document that Elasticsearch inserts, then this same document will be stored multiple times in Elasticsearch with different _id values.
If traffic volumes on the other hand have been too low, resulting in unusually small shards, the shrink index API can be used to reduce the number of shards in the index. As you have seen in this blog post, it is possible to prevent duplicates in Elasticsearch by specifying a document identifier externally prior to indexing data into Elasticsearch.
You could adjust the importance field at index time by finding all duplicates and choosing one of the duplicates to be 'more important' - maybe the duplicate with the highest score is chosen. From your example, I would add 5000 to the existing value of importance.
The results would now rank as follows.
John/Apple-7000, Sam/Apple-5000, John/Apple-1000
But this means you would need to re-index if you decided to change the 5000 to 10000 to adjust the scoring as it depends on the magnitude of importance.
Alternatively, you could add another field called 'authority' for which you could give a value of 1 for the duplicate with the highest importance and use a scoring function to provide a step at query-time :-
"script_score": {
"script": "(_score * 5000) + doc['importance'].value + (doc['authority'].value * 5000)"
}
Note that the multiplier for _score depends on the original ranking algorithm, this assumes a value for _score from 0.0 to 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With