i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find 'similar' words from the list for single-term queries, where 'similarity' is specifically understood as (damerau) levensthein edit distance. i understand SOLR provides such a distance for spelling suggestions.
in my SOLR schema.xml
, i have configured a field type string
:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
which i use to define a field
<field name='term' type='string' indexed='true' stored='true' required='true'/>
i want to search this field and have results returned according to their levenshtein edit distance. however, when i run a query like webspace~0.1
against SOLR with debugging and explanations on, the report shows that a whole bunch of considerations went into calculating the scores, e.g.:
"1582":"
1.1353534 = (MATCH) sum of:
1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
0.08618848 = queryWeight(term:webpage^0.8148148), product of:
0.8148148 = boost
13.172914 = idf(docFreq=1, maxDocs=386954)
0.008029869 = queryNorm
13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
1.0 = tf(termFreq(term:webpage)=1)
13.172914 = idf(docFreq=1, maxDocs=386954)
1.0 = fieldNorm(field=term, doc=1581)
clearly, for my application, term frequencies, idf
s and so on are meaningless, as each document only contains a single term. i tried to use the spelling suggestions component, but didn't manage to make it return the actual similarity scores.
can anybody provide hints how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf
, idf
, boost
and so included? is there a bare-bones configuration sample for SOLR somewhere? i find the number of options truly daunting.
If you're using a nightly build, then you can sort results based on levenshtein distance using the strdist function:
q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc
More details here and here
Solr/Lucene doesn't appear to be a good fit for this application. You are likely better off. with SimMetrics library . It offers a comprehensive set of string-distance calculators incl. Jaro-Winkler, Levenstein etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With