Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escaping search queries for Google's full text search service

This is a cross-post of https://groups.google.com/d/topic/google-appengine/97LY3Yfd_14/discussion

I'm working with the new full text search service in gae 1.6.6 and I'm having trouble figuring out how to correctly escape my query strings before I pass them off to the search index. The docs mention that certain characters need to be escaped (namely the numeric operators), however they don't specify how the query parser expects the string to be escaped.

The issue I'm having is two-fold:

  1. Failing to escape the crap out of many characters (more than those that are hinted at in the docs) will cause the parser to raise a QueryException.
  2. When I've escaped the query to the point it won't raise, the numeric operators (>, <, >=, <=) no longer parse correctly (not factored into the search).

I setup a test where I feed string.printable into my_index.search() and found that it would raise QueryException on each of the "printable" control characters, which I'm now stripping out, as well as things that would seem innocent like asterisk, comma, parenthesis, braces, tilde. None of these are mentioned in the docs as needing to be escaped.

So far I've tried:

  • cgi.escape()
  • saxutils.escape() with a mapping of ascii to urlencoded equivalents (eg , -> %2C)
  • saxutils.escape() with a mapping of ascii to html entity encoded ascii codes (eg &#123;)
  • urllib.quote_plus()

I've gotten the best results so far using url-style(%NN) replacements, but >, <, >=, and <= continue to fail to yield the expected results from the index. Also, and this doesn't really seem to have anything to do with the escaping issue, but using NOT in front of a field = value type query seems to not be working as advertised either.

tl;dr

How should I be escaping my queries before sending them to the search service so that the parser doesn't raise QueryException and my query yields expected results?

like image 546
Owen Nelson Avatar asked May 24 '12 15:05

Owen Nelson


1 Answers

As briefly explained in the documentation, the query parameter is a string that should conform our query language. Which we should document better.

For now, I recommend you to wrap your queries (or at least some of the words/terms) in double quotes. In that way you would be able to pass all printable characters, but " and \. The following example shows the result.

import string
from google.appengine.api.search import Query
Query('"%s"' % string.printable.replace('"', '').replace('\\', ''))

and you could even pass non printable characters

Query('"%s"' % ''.join(chr(i) for i in xrange(128)).replace('"','').replace('\\', ''))

EDIT: Note that anything that is enclosed in double quotes is an exact match, that is "foo bar" would match against ...foo bar... but no ...bar foo..

like image 179
Sebastian Kreft Avatar answered Oct 07 '22 14:10

Sebastian Kreft