Penalising - but not eliminating duplicates - in ElasticSearch

Tags:

I have some data with duplicate fields. I don't want duplicates to appear together on top of search results, but I don't want to eliminate them altogether. I just want to get a better variety, so the 2nd, 3rd ... nth occurrence of the same field-value would be demoted. Is that possible with ElasticSearch?

For example:

curl -XPOST 'http://localhost:9200/employeeid/info/1' -d '{
 "name": "John",
 "organisation": "Apple",
 "importance": 1000
}'

curl -XPOST 'http://localhost:9200/employeeid/info/2' -d '{
 "name":"John",
 "organisation":"Apple",
 "importance": 2000
 }'

curl -XPOST 'http://localhost:9200/employeeid/info/3' -d '{
 "name": "Sam",
 "organisation": "Apple",
 "importance": 0
 }'

(based on this)

If we assume search is boosted by importance, the natural result for "Apple" search would be John, John, Sam. What I am looking for is a way to make the result John, Sam, John, ie penalise the second John because another John has already appeared.

697

asked Feb 16 '17 16:02

mahemoff

1 Answers

You could adjust the importance field at index time by finding all duplicates and choosing one of the duplicates to be 'more important' - maybe the duplicate with the highest score is chosen. From your example, I would add 5000 to the existing value of importance.

The results would now rank as follows.

John/Apple-7000, Sam/Apple-5000, John/Apple-1000

But this means you would need to re-index if you decided to change the 5000 to 10000 to adjust the scoring as it depends on the magnitude of importance.

Alternatively, you could add another field called 'authority' for which you could give a value of 1 for the duplicate with the highest importance and use a scoring function to provide a step at query-time :-

"script_score": {
   "script": "(_score * 5000) + doc['importance'].value + (doc['authority'].value * 5000)"
}

Note that the multiplier for _score depends on the original ranking algorithm, this assumes a value for _score from 0.0 to 1.0

181

answered Nov 16 '22 03:11

abdollar

Related questions
                            
                                Elasticsearch: Get phrase frequency in a given document
                            
                                Find parent documents based on child doc value
                            
                                Elasticsearch garbage collection warnings (JvmGcMonitorService)
                            
                                Elasticsearch, Tire & Associations
                            
                                My elasticsearch instance is up and running, let's move it into production.
                            
                                ElasticSearch aggregation: exclude one filter per aggregation
                            
                                How can I configure a structure for backing up elasticsearch data on Google Compute Engine?
                            
                                Search tool in Meteor JS
                            
                                How to store Java 8 (JSR-310) dates in elasticsearch
                            
                                AND query in elasticsearch with curl
                            
                                Enabling regex support on AWS Managed ElasticSearch in painless scripts
                            
                                Aggregate and filter from one index to another through a third
                            
                                Is it a good idea to use serilog to write logs directly to the elasticsearch
                            
                                Elasticsearch random selection based on weighting out of 100
                            
                                elasticsearch "block until refresh"/"wait for doc to be searchable" alternatives
                            
                                How to specify ElasticSearch copy_to order?
                            
                                Elasticsearch context suggester, bool on contexts
                            
                                Schemaless Support for Elastic Search Queries
                            
                                Elasticsearch aggregation on multiple fields across multiple indexes
                            
                                Elasticsearch : search results on clicking on Hashtag

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Penalising - but not eliminating duplicates - in ElasticSearch

Tags:

duplicates

elasticsearch

mahemoff

People also ask

1 Answers

abdollar

Recent Activity

Donate For Us