Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TermQuery not returning on a known search term, but WildcardQuery does

Am hoping someone with enough insight into the inner workings of Lucene might be able to point me in the right direction =)

I'll skip most of the surrounding irellevant code, and cut right to the chase. I have a Lucene index, to which I am adding the following field to the index (variables replaced by their literal values):

document.Add( new Field("Typenummer", "E5CEB501A244410EB1FFC4761F79E7B7", 
                        Field.Store.YES , Field.Index.UN_TOKENIZED));

Later, when I search my index (using other types of queries), I am able to verify that this field does indeed appear in my index - like when looping through all Fields returned by Document.GetFields()

Field: Typenummer, Value: E5CEB501A244410EB1FFC4761F79E7B7

So far so good :-)

Now the real problem is - why can I not use a TermQuery to search against this value and actually get a result.

This code produces 0 hits:

// Returns 0 hits
bq.Add( new TermQuery( new Term( "Typenummer", 
        "E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );

But if I switch this to a WildcardQuery (with no wildcards), I get the 1 hit I expect.

// returns the 1 hit I expect
bq.Add( new WildcardQuery( new Term( "Typenummer", 
        "E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );

I've checked field lengths, I've checked that I am using the same Analyzer and so on and I am still on square 1 as to why this is.

Can anyone point me in a direction I should be looking?

like image 259
Mark Cassidy Avatar asked Feb 24 '12 12:02

Mark Cassidy


1 Answers

I finally figured out what was going on. I'm expanding the tags for this question as it, much to my surprise, actually turned out to be an issue with the CMS this particular problem exists in. In summary, the problem came down to this:

  1. The field is stored UN_TOKENIZED, meaning Lucene will store it excactly "as-is"
  2. The BooleanQuery I pasted snippets from gets sent to the Sitecore SearchManager inside a PreparedQuery wrapper
  3. The behaviour I expected from this was, that my query (having already been prepared) would go - unaltered - to the Lucene API
  4. Turns out I was wrong. It passes through a RewriteQuery method that copies my entire set of nested queries as-is, with one exception - all the Term arguments are passed through a LowercaseStrategy()
  5. As I indexed an UPPERCASE Term (UN_TOKENIZED), and Sitecore changes my PreparedQuery to lowercase - 0 results are returned

Am not going to start an argument of whether this is "by design" or "by design flaw" implementation of the Lucene Wrapper API - I'll just note that rewriting my query when using the PreparedQuery overload is... to me... unexpected ;-)

Further teachings from this; storing the field as TOKENIZED will eliminate this problem too, as the StandardAnalyzer by default will lowercase all tokens.

like image 159
Mark Cassidy Avatar answered Oct 10 '22 23:10

Mark Cassidy