Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr/Lucene query lemmatization with context

I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or after) to the lemmatizer.

For example the phrase pila vodu is analyzed differently at index time than at query time. It uses the ambiguous word pila, which could mean pila (saw e.g. chainsaw) or pít (the past tense of the verb "to drink").

pila vodu ->

  • Index time: pít voda
  • Query time: pila voda

.. so the word pila is not found and not highlighted in a document snippet.

This behaviour is documented at the solr wiki (quoted bellow) and I can confirm it by debugging my code (only isolated strings "pila" and "vodu" are passed to the lemmatizer).

... The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, ...

So my question is:

Is it possible to somehow change, configure or adapt the query parser so the lemmatizer would see the whole query string, or at least some context of individual words? I would like to have a solution also for different solr query parsers like dismax or edismax.

I know that there is no such issue with phrase queries like "pila vodu" (quotes), but then I would lose the documents without the exact phrase (e.g. documents with "pila víno" or even "pila dobrou vodu").


Edit - trying to explain / answer following question (thank you @femtoRgon):

If the two terms aren't a phrase, and so don't necessarily come together, then why would they be analyzed in context to one another?

For sure it would be better to analyze only terms coming together. For example at the indexing time, the lemmatizer detects sentences in the input text and it analyzes together only words from a single sentence. But how to achieve a similar thing at the query time? Is implementing my own query parser the only option? I quite like the pf2 and pf3 options of the edismax parser, would I have to implement them again in case of my own parser?

The idea behind is in fact a bit deeper because the lemmatizer is doing word-sense-disambiguation even for words that has the same lexical base. For example the word bow has about 7 different senses in English (see at wikipedia) and the lemmatizer is distinguishing such senses. So I would like to exploit this potential to make searches more precise -- to return only documents containing the word bow in the concrete sense required by the query. So my question could be extended to: How to get the correct <lemma;sense>-pair for a query term? The lemmatizer is very often able to assign the correct sense if the word is presented in its common context, but it has no chance when there is no context.

like image 226
dedek Avatar asked Oct 04 '16 10:10

dedek


1 Answers

Finally, I implemented my own query parser.

It wasn't that difficult thanks to the edismax sources as a guide and a reference implementation. I could easily compare my parser results with the results of edismax...

Solution :

First, I analyze the whole query string together. This gives me the list of "tokens".

There is a little clash with stop words - it is not that easy to get tokens for stop words as they are omitted by the analyzer, but you can detect them from PositionIncrementAttribute.

From "tokens" I construct the query in the same way as edismax do (e.g. creating all 2-token and/or 3-token phrase queries combined in DisjunctionMaxQuery instances).

like image 184
dedek Avatar answered Oct 17 '22 02:10

dedek