I need your help on index design for a real scenario. It might be a long question, let me try explain it as concise as possible.
We are building a search platform based on Elasticsearch to provide site search experience for our customers, the document in index could be something like this:
{ "Path":"http://www.foo.com/doc/abc/1", "Title":"Title 1", "Description":"The description of doc 1", ... }
{ "Path":"http://www.foo.com/doc/abc/2", "Title":"Title 2", "Description":"The description of doc 2", ... }
{ "Path":"http://www.foo.com/doc/abc/3", "Title":"Title 3", "Description":"The description of doc 3", ... }
...
For each query, the returned hit documents are by default sorted by relevance, but our customer also wants to boost some specific documents for some keywords,
They give us the following like boosting configuration XML:
<boost>
<Keywords value="keyword1">
<Path rank="10000">http://www.foo.com/doc/abc/1</Path>
</Keywords>
<Keywords value="keyword2">
<Path rank="10000">http://www.foo.com/doc/abc/2</Path>
<Path rank="9900">http://www.foo.com/doc/abc/1</Path>
</Keywords>
<Keywords value="keyword3">
<Path rank="10000">http://www.foo.com/doc/abc/3</Path>
<Path rank="9900">http://www.foo.com/doc/abc/2</Path>
<Path rank="9800">http://www.foo.com/doc/abc/1</Path>
</Keywords>
</boost>
That mean, if user search “keyword1", the top 1 hit document should be the document whose Path field value is "www.foo.com/doc/abc/1", regardless the relevance score of that document. Similarly, if search "keyword3", the top 3 hit documents should be the documents whose Path values are "www.foo.com/doc/abc/3", "www.foo.com/doc/abc/2" and "www.foo.com/doc/abc/1" respectively.
To satisfy this special requirement, my design is, firstly invert the original boosting XML to following format:
<boost>
<Path value="http://www.foo.com/doc/abc/1">
<keywords>
<keyword value="keyword1" rank="10000" />
<keyword value="keyword2" rank="9900" />
<keyword value="keyword3" rank="9800" />
</keywords>
</Path>
<Path value="http://www.foo.com/doc/abc/2">
<keywords>
<keyword value="keyword2" rank="10000" />
<keyword value="keyword3" rank=9900" />
</keywords>
</Path>
<Path value="http://www.foo.com/doc/abc/3">
<keywords>
<keyword value="keyword3" rank="10000" />
</keywords>
</Path>
</boost>
Then add a nested field "Boost", which contains a array of keyword/rank fields, to the Elasticsearch document as following example:
{
"Boost": [
{ "keyword":"keyword1", "rank": 10000},
{ "keyword":"keyword2", "rank": 9900},
{ "keyword":"keyword3", "rank": 9800}
]
"Path":"http://www.foo.com/doc/abc/1",
"Title":"Title 1",
"Description":"The description of doc 1",
...
}
{
"Boost": [
{ "keyword":"keyword2", "rank": 10000},
{ "keyword":"keyword3", "rank": 9900}
]
"Path":"http://www.foo.com/doc/abc/2",
"Title":"Title 2",
"Description":"The description of doc 2",
...
}
{
"Boost": [
{ "keyword":"keyword3", "rank": 10000}
]
"Path":"http://www.foo.com/doc/abc/3",
"Title":"Title 3",
"Description":"The description of doc 3",
...
}
Then in query time, use nested query to get the rank value of each matched document for a given search keyword, and then use the score script to adjust the relevance score by this rank value.
Since the rank value from boosting XML is much larger than normal relevance score ( generally less than 5), the adjusted score of the documents which configured in boosting XML for given keyword should be top scores.
Do you think it is a good design on Elasticsearch? Any suggestions to better approaches?
Thanks in advance!
It may be better to index the keywords in a separate field with the original documents and then, during search, just boost match in that field.
This is not exactly what you described, as it doesn't give you fine control over boost factor for each keyword. But this is definitely a way to make specific documents appear higher in the search results if the query contains specific keywords.
If you really need to have better control over boost factor for different keywords, you still can do this using this method. But you'll need to create several "boosted keywords" fields and boost them differently in the query.
For example:
{ "Path":"http://www.foo.com/doc/abc/1",
"Title":"Title 1",
"Description":"The description of doc 1",
"boost_kw1": "keyword1 keyword2",
"boost_kw2": "keyword3 keyword4" },
{ "Path":"http://www.foo.com/doc/abc/1",
"Title":"Title 1",
"Description":"The description of doc 1",
"boost_kw1": "keyword3",
"boost_kw2": "keyword1 keyword2" }
And in the query you calculate the total score as the sum of:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With