I'm using Lucene.net, but I am tagging this question for both .NET and Java versions because the API is the same and I'm hoping there are solutions on both platforms.
I'm sure other people have addressed this issue, but I haven't been able to find any good discussions or examples.
By default, Lucene is very picky about query syntax. For example, I just got the following error:
[ParseException: Cannot parse 'hi there!': Encountered "<EOF>" at line 1, column 9.
Was expecting one of:
"(" ...
"*" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
"[" ...
"{" ...
<NUMBER> ...
]
Lucene.Net.QueryParsers.QueryParser.Parse(String query) +239
What is the best way to prevent ParseExceptions when processing queries from users? It seems to me that the most usable search interface is one that always executes a query, even if it might be the wrong query.
It seems that there are a few possible, and complementary, strategies:
I don't really have any great ideas about how to do any of those strategies. Has anyone else addressed this issue? Are there any "simple" or "graceful" parsers that I don't know about?
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.
You can't search for special characters in Lucene Search. These are + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ / @.
Lucene is a query language that can be used to filter messages in your PhishER inbox. A query written in Lucene can be broken down into three parts: Field The ID or name of a specific container of information in a database. If a field is referenced in a query string, a colon ( : ) must follow the field name.
You can embed Boolean operators in a query string to improve the precision of a match. The full syntax supports text operators in addition to character operators. Always specify text boolean operators (AND, OR, NOT) in all caps.
Yo can make Lucene ignore the special characters by sanitizing the query with something like
query = QueryParser.Escape(query)
If you do not want your users to ever use advanced syntax in their queries, you can do this always.
If you want your users to use advanced syntax but you also want to be more forgiving with the mistakes you should only sanitize after a ParseException has occured.
Well, the easiest thing to do would be to give the raw form of the query a shot, and if that fails, fall back to cleaning it up.
Query safe_query_parser(QueryParser qp, String raw_query)
throws ParseException
{
Query q;
try {
q = qp.parse(raw_query);
} catch(ParseException e) {
q = null;
}
if(q==null)
{
String cooked;
// consider changing this "" to " "
cooked = raw_query.replaceAll("[^\w\s]","");
q = qp.parse(cooked);
}
return q;
}
This gives the raw form of the user's query a chance to run, but if parsing fails, we strip everything except letters, numbers, spaces and underscores; then we try again. We still risk throwing ParseException, but we've drastically reduced the odds.
You could also consider tokenizing the user's query yourself, turning each token into a term query, and glomming them together with a BooleanQuery. If you're not really expecting your users to take advantage of the features of the QueryParser, that would be the best bet. You'd be completely(?) robust, and users could search for whatever funny characters will make it through your analyzer
FYI... Here is the code I am using for .NET
private Query GetSafeQuery(QueryParser qp, String query)
{
Query q;
try
{
q = qp.Parse(query);
}
catch(Lucene.Net.QueryParsers.ParseException e)
{
q = null;
}
if(q==null)
{
string cooked;
cooked = Regex.Replace(query, @"[^\w\.@-]", " ");
q = qp.Parse(cooked);
}
return q;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With