Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make the Lucene QueryParser more forgiving?

I'm using Lucene.net, but I am tagging this question for both .NET and Java versions because the API is the same and I'm hoping there are solutions on both platforms.

I'm sure other people have addressed this issue, but I haven't been able to find any good discussions or examples.

By default, Lucene is very picky about query syntax. For example, I just got the following error:

[ParseException: Cannot parse 'hi there!': Encountered "<EOF>" at line 1, column 9.
Was expecting one of:
    "(" ...
    "*" ...
    <QUOTED> ...
    <TERM> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    "[" ...
    "{" ...
    <NUMBER> ...
    ]
   Lucene.Net.QueryParsers.QueryParser.Parse(String query) +239

What is the best way to prevent ParseExceptions when processing queries from users? It seems to me that the most usable search interface is one that always executes a query, even if it might be the wrong query.

It seems that there are a few possible, and complementary, strategies:

  • "Clean" the query prior to sending it to the QueryProcessor
  • Handle exceptions gracefully
    • Show an intelligent error message to the user
    • Perhaps execute a simpler query, leaving off the erroneous bit

I don't really have any great ideas about how to do any of those strategies. Has anyone else addressed this issue? Are there any "simple" or "graceful" parsers that I don't know about?

like image 323
Winston Fassett Avatar asked Nov 04 '08 19:11

Winston Fassett


People also ask

How do you use the wildcard in Lucene?

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.

What are Lucene special characters?

You can't search for special characters in Lucene Search. These are + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ / @.

What is Lucene language?

Lucene is a query language that can be used to filter messages in your PhishER inbox. A query written in Lucene can be broken down into three parts: Field The ID or name of a specific container of information in a database. If a field is referenced in a query string, a colon ( : ) must follow the field name.

Can Boolean operators like and/or and so on be used in Lucene query syntax?

You can embed Boolean operators in a query string to improve the precision of a match. The full syntax supports text operators in addition to character operators. Always specify text boolean operators (AND, OR, NOT) in all caps.


3 Answers

Yo can make Lucene ignore the special characters by sanitizing the query with something like

query = QueryParser.Escape(query)

If you do not want your users to ever use advanced syntax in their queries, you can do this always.

If you want your users to use advanced syntax but you also want to be more forgiving with the mistakes you should only sanitize after a ParseException has occured.

like image 121
ljorquera Avatar answered Sep 22 '22 18:09

ljorquera


Well, the easiest thing to do would be to give the raw form of the query a shot, and if that fails, fall back to cleaning it up.

Query safe_query_parser(QueryParser qp, String raw_query)
  throws ParseException
{
  Query q;
  try {
    q = qp.parse(raw_query);
  } catch(ParseException e) {
    q = null;
  }
  if(q==null)
    {
      String cooked;
      // consider changing this "" to " "
      cooked = raw_query.replaceAll("[^\w\s]","");
      q = qp.parse(cooked);
    }
  return q;
}

This gives the raw form of the user's query a chance to run, but if parsing fails, we strip everything except letters, numbers, spaces and underscores; then we try again. We still risk throwing ParseException, but we've drastically reduced the odds.

You could also consider tokenizing the user's query yourself, turning each token into a term query, and glomming them together with a BooleanQuery. If you're not really expecting your users to take advantage of the features of the QueryParser, that would be the best bet. You'd be completely(?) robust, and users could search for whatever funny characters will make it through your analyzer

like image 33
Jay Kominek Avatar answered Sep 22 '22 18:09

Jay Kominek


FYI... Here is the code I am using for .NET

private Query GetSafeQuery(QueryParser qp, String query)
{
    Query q;
    try 
    {
        q = qp.Parse(query);
    } 

    catch(Lucene.Net.QueryParsers.ParseException e) 
    {
        q = null;
    }

    if(q==null)
    {
        string cooked;

        cooked = Regex.Replace(query, @"[^\w\.@-]", " ");
        q = qp.Parse(cooked);
    }

    return q;
}
like image 30
josefresno Avatar answered Sep 24 '22 18:09

josefresno