In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query
Meaning: Please match words that begin with "bla" or something similar to "bla".
Update: What I do now, works for small input, is use the following (snippet of SOLR schema):
<fieldtype name="text_ngrams" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
In case you don't use SOLR, this does the following.
Indextime: Index data by creating a field containing all prefixes of my (short) input.
Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.
FuzzyQuery is used to search documents using fuzzy implementation that is an approximate search based on the edit distance algorithm.
Special characters( ) { } [ ] ^ “ ~ * ? : \ / are reserved for the Lucene Query String parser, so you'll need to escape them with \ before the character if you need to use it. For example, f-150 should be wrapped up as f\-150 , or wrapped inside double quotes as "f-150" .
in the development trunk of lucene (not yet a release), there is code to support use cases like this, via AutomatonQuery. Warning: the APIs might/will change before its released, but it gives you the idea.
Here is an example for your case:
// a term representative of the query, containing the field.
// the term text is not so important and only used for toString() and such
Term term = new Term("yourfield", "bla~*");
// builds a DFA that accepts all strings within an edit distance of 2 from "bla"
Automaton fuzzy = new LevenshteinAutomata("bla").toAutomaton(2);
// concatenate this DFA with another DFA equivalent to the "*" operator
Automaton fuzzyPrefix = BasicOperations.concatenate(fuzzy, BasicAutomata.makeAnyString());
// build a query, search with it to get results.
AutomatonQuery query = new AutomatonQuery(term, fuzzyPrefix);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With