Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-word synonym search in Solr

Tags:

solr

synonym

I'm trying to use a synonym filter to search for a phrase.

peter=> spider man, spiderman, Mary Jane, .....

I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".

like image 345
Kuan Avatar asked Apr 16 '15 16:04

Kuan


1 Answers

Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.

This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.

For more background, there's this blog post

There's a large number of solutions to this problem, including the following:

  • hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
  • Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
  • OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
  • Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
  • Roll your own: Learn to write your own query parser plugin and handle the problem however you want.
like image 189
Doug T. Avatar answered Oct 02 '22 16:10

Doug T.