Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene.Net Underscores causing token split

Tags:

tsql

lucene

I've scripted a MsSqlServer databases tables,views and stored procedures into a directory structure that I am then indexing with Lucene.net. Most of my table, view and procedure names contain underscores.

I use the StandardAnalyzer. If I query for a table named tIr_InvoiceBtnWtn01, for example, I recieve hits back for tIr and for InvoiceBtnWtn01, rather than for just tIr_InvoiceBtnWtn01.

I think the issue is the tokenizer is splitting on _ (underscore) since it is punctuation.

Is there a (simple) way to remove underscores from the punctuation list or is there another analyzer that I should be using for sql and programming languages?

like image 429
automatic Avatar asked Dec 01 '10 15:12

automatic


1 Answers

Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.

WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.

EDIT: Simple custom analyzer (in C#, but you can translate it to Java pretty easily):

// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
        StandardFilter standardFilter = new StandardFilter(baseTokenizer);
        LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
        return lcFilter; 
    }
}
like image 58
Xodarap Avatar answered Oct 12 '22 21:10

Xodarap