Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr search with/without hyphen

Tags:

solr

I got an issue trying to get relevant search results working with words with and without hyphen. I created two documents, one with "wifi" and one with "wi-fi" in the "text" field.

When searching "wifi", both document appears in the search result, which is fine. When searching "wi-fi", only the document with "wi-fi" appears in the search result.

Here is my configuration :

<field name="text" type="text" indexed="true" stored="true" omitNorms="true" />

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory" />
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory" />
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

Here is the result of the analyzis : https://www.evernote.com/shard/s7/sh/f1bab83a-7fd5-4bf3-9e67-239ea0c71441/98b1103577638734fb9335f755591b82/deep/0/Solr-Admin-(jeanfrancoiscote.egzakt.com).png

Debug of the query when searching "wi-fi". I can't find out why it doesn't find both documents :

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="debugQuery">true</str>
    <str name="indent">true</str>
    <str name="q">wi-fi</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
  <doc>
    <int name="id">1869</int>
    <str name="route">@sujet_simple?sujet_id=1869&amp;slug=wi-fi</str>
    <str name="name">Wi-fi</str>
    <str name="text">&lt;p&gt;
    Wi-fi&lt;/p&gt;
</str>
    <long name="_version_">1493472450933948416</long></doc>
</result>
<lst name="debug">
  <str name="rawquerystring">wi-fi</str>
  <str name="querystring">wi-fi</str>
  <str name="parsedquery">MultiPhraseQuery(text:"(wi-fi wi) (fi wifi)")</str>
  <str name="parsedquery_toString">text:"(wi-fi wi) (fi wifi)"</str>
  <lst name="explain">
    <str name="1869">
30.33298 = (MATCH) weight(text:"(wi-fi wi) (fi wifi)" in 0) [DefaultSimilarity], result of:
  30.33298 = score(doc=0,freq=1.0 = phraseFreq=1.0
), product of:
    0.99999994 = queryWeight, product of:
      30.332981 = idf(), sum of:
        7.684612 = idf(docFreq=1, maxDocs=1600)
        7.684612 = idf(docFreq=1, maxDocs=1600)
        7.684612 = idf(docFreq=1, maxDocs=1600)
        7.2791467 = idf(docFreq=2, maxDocs=1600)
      0.032967415 = queryNorm
    30.332981 = fieldWeight in 0, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = phraseFreq=1.0
      30.332981 = idf(), sum of:
        7.684612 = idf(docFreq=1, maxDocs=1600)
        7.684612 = idf(docFreq=1, maxDocs=1600)
        7.684612 = idf(docFreq=1, maxDocs=1600)
        7.2791467 = idf(docFreq=2, maxDocs=1600)
      1.0 = fieldNorm(doc=0)
</str>
  </lst>
  <str name="QParser">LuceneQParser</str>
  <lst name="timing">
    <double name="time">1.0</double>
    <lst name="prepare">
      <double name="time">0.0</double>
      <lst name="query">
        <double name="time">0.0</double>
      </lst>
      <lst name="facet">
        <double name="time">0.0</double>
      </lst>
      <lst name="mlt">
        <double name="time">0.0</double>
      </lst>
      <lst name="highlight">
        <double name="time">0.0</double>
      </lst>
      <lst name="stats">
        <double name="time">0.0</double>
      </lst>
      <lst name="debug">
        <double name="time">0.0</double>
      </lst>
    </lst>
    <lst name="process">
      <double name="time">1.0</double>
      <lst name="query">
        <double name="time">0.0</double>
      </lst>
      <lst name="facet">
        <double name="time">0.0</double>
      </lst>
      <lst name="mlt">
        <double name="time">0.0</double>
      </lst>
      <lst name="highlight">
        <double name="time">0.0</double>
      </lst>
      <lst name="stats">
        <double name="time">0.0</double>
      </lst>
      <lst name="debug">
        <double name="time">1.0</double>
      </lst>
    </lst>
  </lst>
</lst>
</response>

Thanks for your help.

like image 689
Tiois Avatar asked Sep 28 '22 17:09

Tiois


1 Answers

You need to adjust the analysis side of your schema. debugQuery=true and the Solr Analysis tools are your friends for finding these kind of errors.

Taking your configuration the search for wifi produces the following query:

wifi
"parsedquery_toString": "text:wifi",

and for wi-fi

wi-fi
"parsedquery_toString": "text:\"(wi-fi wi) (fi wifi)\"",

The analysis side of our configuration generating none matching terms for wi-fi.

If we change the filter in the analysis side not to produce word parts:

  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />

We get the following search phrases generated for wifi

parsedquery_toString": "text:wifi",

and for wi-fi:

"parsedquery_toString": "text:wi-fi text:wifi"

Which match out indexed terms for wi-fi and wifi from the Analysis tool

wi-fi, wi, fi, wifi
wifi

Note: text is our default field in this example

like image 193
David George Avatar answered Oct 21 '22 12:10

David George