Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyParsing OR statement

This is going to end up being really simple, but I'm trying to match one of the two patterns:

"GET /ligonier-broadcast-media/mp3/rym20110421.mp3 HTTP/1.1"

or

-

I've tried something like this:

key = Word(alphas + nums + "/" + "-" + "_" + "." + "?" + "=" + "%" + "&")

uri = Or("-" | Group(
                   Suppress("\"") +
                   http_method +
                   key.setResultsName("request_uri") +
                   http_protocol +
                   Suppress("\"")
               )
      )

But it doesn't seem to match. I'm not at all sure how to use Or(), if I should be Group()'ing, or what. I know the arguments provided within the Group() class work if called separately, but I really need either the dash or the quoted URI string, not just one.

Log format can't be negotiated, we're consuming what we've been given. Any tips would be greatly appreciated.

like image 974
Greg Avatar asked Apr 27 '11 03:04

Greg


2 Answers

In general, the Or, And, MatchFirst, and Each classes are very rarely used overtly in pyparsing. The recommended style is to use their analogous operator overloads. In your case, you are using both forms, and it is just getting in your way.

Here is your expression, after a little cleaning up:

key = Word(alphanums + "/-_.?=%&")
QUOT = Suppress('"')
uri = ("-" | QUOT
             + http_method
             + key("request_uri")
             + http_protocol
             + QUOT
      )

The arguments to Word are character strings representing sets of allowed characters. If just one argument is used (as in your case), then the string is interpreted as simply the set of characters that can be parsed as part of the Word. If 2 strings are given, then the first represents the set of acceptable initial characters, and the second is the set of acceptable body characters (useful when defining something like a variable name, which in Python for instance allows only alphas and '_' for the initial character, but also allows numeric digits in the body. This would be Word(alphas+'_', alphanums+'_'). Since the arguments to Word are just strings, there is no need to separately add "/" + "-" + "_" + ..., just combine them to a single string.

The '|' operator delimits allowed alternatives, generating a MatchFirst expression. It is called MatchFirst because the parser will stop trying after the first given expression matches. So if parsing the string "abc" with Word(alphas) | Word(nums), pyparsing won't even try matching the Word(nums) expression - the first alternative matches. This gets trickier if there is some overlap in what you want. Let's say you want to match words of letters, words of alphas, or words of letters and alphas, and you want to parse the string "abc123". This parser:

Word(alphas) | Word(nums) | Word(alphanums)

will parse the opening 'abc' of the string with the leading Word(alphas). We can often resolve such an issue by rearranging the alternatives, such as:

Word(alphanums) | Word(alphas) | Word(nums)

but not all cases are so easily refactored. So pyparsing also supports the Or expression, defined using the '^' operator (which I chose because the '^' reminds me of a pair of draftsman's dividers, for measuring length). An Or expression tries to apply all the given alternatives, and selects the longest matching one. So you could write my little test example as:

Word(alphas) ^ Word(nums) ^ Word(alphanums)

and now pyparsing will not stop when matching "abc", but will try all of the alternatives, and eventually select the third alternative, matching "abc123", because it gives a longer match.

For your URI definition, there is no need to do Or matching. There is no way the parser will confuse a leading '-' with a quoted HTTP command string. So using MatchFirst, which you have done by using the '|' operator, is perfectly adequate.

Some other items:

  • Don't write "\"" in Python if you can help it. Python supports both quoting characters for just this reason. Use '"' instead. Backslashes are for C programmers and Windows file names.

  • expr.setResultsName("name") has been simplified to expr("name") since pyparsing 1.4.6. The shortened syntax really helps the readability of your parser definitions.

  • Use Group only when you want to keep some structure in your results, or if you have a repeated structure that has some internal expression with a results name. Not really necessary for your parser, and just adds another list container wrapper on the results, requiring an extra [0] index to get to your parsed data.

(If you do decide you want to explicitly call Or, And, etc., be sure to pass a list of the expressions, and don't just list them as arguments to the expression constructor - see Why is ordered choice in pyparsing failing for my use case? for how such a typo can mess things up, which is why I encourage using the arithmetic operators to compose your parsers.)

like image 153
PaulMcG Avatar answered Oct 01 '22 21:10

PaulMcG


I think you want...

from pyparsing import oneOf
# more code here
uri = oneOf(["-", <insert long match expr here>])`
uri.matchString(someStringVar)
like image 28
Mike Pennington Avatar answered Oct 01 '22 22:10

Mike Pennington