Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Lucene to search for email addresses

I want to use Lucene (in particular, Lucene.NET) to search for email address domains.

E.g. I want to search for "@gmail.com" to find all emails sent to a gmail address.

Running a Lucene query for "*@gmail.com" results in an error, asterisks cannot be at the start of queries. Running a query for "@gmail.com" doesn't return any matches, because "[email protected]" is seen as a whole word, and you cannot search for just parts of a word.

How can I do this?

like image 881
Judah Gabriel Himango Avatar asked Aug 20 '08 22:08

Judah Gabriel Himango


3 Answers

No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.

The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.

Here's the source code, it's actually very simple:

class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
    public WhitespaceAndAtSymbolTokenizer(TextReader input)
        : base(input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        // Make whitespace characters and the @ symbol be indicators of new words.
        return !(char.IsWhiteSpace(c) || c == '@');
    }
}


internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        return new WhitespaceAndAtSymbolTokenizer(reader);
    }
}

That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:

IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);

Performing searches should use the analyzer as well:

IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);
like image 165
Judah Gabriel Himango Avatar answered Oct 24 '22 01:10

Judah Gabriel Himango


I see you have your solution, but mine would have avoided this and added a field to the documents you're indexing called email_domain, into which I would have added the parsed out domain of the email address. It might sound silly, but the amount of storage associated with this is pretty minimal. If you feel like getting fancier, say some domain had many subdomains, you could instead make a field into which the reversed domain went, so you'd store com.gmail, com.company.department, or ae.eim so you could find all the United Arab Emirates related addresses with a prefix query of 'ae.'

like image 33
dlamblin Avatar answered Oct 24 '22 00:10

dlamblin


There also is setAllowLeadingWildcard

But be careful. This could get very performance expensive (thats why it is disabled by default). Maybe in some cases this would be an easy solution, but I would prefer a custom Tokenizer as stated by Judah Himango, too.

like image 2
Markus Avatar answered Oct 24 '22 00:10

Markus