Lucene query: bla~* (match words that start with something fuzzy), how?

Tags:

In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query

Meaning: Please match words that begin with "bla" or something similar to "bla".

Update: What I do now, works for small input, is use the following (snippet of SOLR schema):

<fieldtype name="text_ngrams" class="solr.TextField">
  <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>

In case you don't use SOLR, this does the following.

Indextime: Index data by creating a field containing all prefixes of my (short) input.

Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.

696

asked Apr 13 '10 16:04

Pimin Konstantin Kefaloukos

1 Answers

in the development trunk of lucene (not yet a release), there is code to support use cases like this, via AutomatonQuery. Warning: the APIs might/will change before its released, but it gives you the idea.

Here is an example for your case:

// a term representative of the query, containing the field. 
// the term text is not so important and only used for toString() and such
Term term = new Term("yourfield", "bla~*");

// builds a DFA that accepts all strings within an edit distance of 2 from "bla"
Automaton fuzzy = new LevenshteinAutomata("bla").toAutomaton(2);

// concatenate this DFA with another DFA equivalent to the "*" operator
Automaton fuzzyPrefix = BasicOperations.concatenate(fuzzy, BasicAutomata.makeAnyString());

// build a query, search with it to get results.
AutomatonQuery query = new AutomatonQuery(term, fuzzyPrefix);

answered Oct 22 '22 20:10

Robert Muir

Related questions
                            
                                Lucene: Multi-word phrases as search terms
                            
                                What is the Java API to escape Elasticsearch special characters?
                            
                                How to instruct StandardAnalyzer in Lucene to not to remove stop words?
                            
                                How do you run Lucene on .net?
                            
                                How to Build PDFBox for .Net
                            
                                what's the difference between grouping and facet in lucene 3.5
                            
                                Need to know pros and cons of using RAMDirectory
                            
                                Getting maximum value of field in solr
                            
                                Lucene in Android
                            
                                Using CLucene vs java lucene
                            
                                Lemmatization with apache lucene
                            
                                very slow highlight performance in lucene
                            
                                Lucene IndexWriter thread safety
                            
                                Using Solr for indexing and search with Mongodb and nodejs
                            
                                Solr/Lucene is it possible to order first by relevance, and then by a second attribute?
                            
                                ElasticSearch:filtering documents based on field length?
                            
                                nutch vs solr indexing
                            
                                create new core directories in SOLR on the fly
                            
                                How to use a BooleanQuery builder in Lucene 5.3.x?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Lucene query: bla~* (match words that start with something fuzzy), how?

Tags:

lucene

wildcard

fuzzy-search

Pimin Konstantin Kefaloukos

People also ask

1 Answers

Robert Muir

Recent Activity

Donate For Us