Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping match_phrase search results by match text in Elastic Search

Given a phrase match query like this:

{
    'match_phrase': {
        'text.english': {
            'query': "The fox jumped over the wall",
            'phrase_slop': 4,
        }
    }
}

Is there a way I can group results by the exact match?

So if I have 1 document with text.english containing "The quick fox jumps over the small wall" and 3 documents containing "The lazy fox jumped over the big wall", I end up with those two groups of results.

I'm OK with running multiple queries and doing some processing outside of ES, but I need a solution that performs reasonably over a large set of documents. Ideally I'm hoping there's a way to do this using aggregations that I've missed.

The best solution I've come up with is to run the query above with highlights, parse out all of the highlights from all of the results, and group them based on highlight content. This is fine for very small result sets, however over a 1000+ document result set it is prohibitively slow.

EDIT: Maybe I can make this a bit clearer. If I have sample documents with the following values:

  1. "The quick fox jumps over the small wall. Blah blah blah many pages of unrelated text."
  2. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."
  3. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."
  4. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."

I want to be able to group my results as follows with query text "The fox jumped over the wall":

  • "The quick fox jumps over the small wall" - Document 1
  • "The lazy fox jumped over the big wall" - Documents 2, 3, 4
like image 628
Cole Maclean Avatar asked Oct 23 '15 14:10

Cole Maclean


1 Answers

In my opinion, highlighting is the only option because it's the only way Elasticsearch will show which "parts" of text matched. And in your case, you want to group documents based on what "matched.

If the text would have been shorter (like few words), maybe a more involved solution would have been to split the text in a shingle-kind of way and somehow group on those phrases... maybe.

But for pages of text, I think the only option is to use highlighting and perform additional steps afterwards to group the highlighted parts.

like image 133
Andrei Stefan Avatar answered Sep 17 '22 14:09

Andrei Stefan