I am looking for a regex pattern matcher for a String in HttpLogFormat. The log is generated by haproxy. Below is a sample String in this format.
Feb 6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} "GET /index.html HTTP/1.1"
An explanation of the format is available at HttpLogFormat. Any help is appreciated.
I am trying to get the individual peices of information included in that line. Here are the fields:
Using special characters For example, to match a single "a" followed by zero or more "b" s followed by "c" , you'd use the pattern /ab*c/ : the * after "b" means "0 or more occurrences of the preceding item."
The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.
Most importantly, regexes should not be used for parsing strings. You should instead use or write a bespoke parser. For example, you can't parse HTML with regex (in Python, use BeautifulSoup; in JavaScript, use the DOM).
Regex:
^(\w+ \d+ \S+) (\S+) (\S+)\[(\d+)\]: (\S+):(\d+) \[(\S+)\] (\S+) (\S+)/(\S+) (\S+) (\S+) (\S+) *(\S+) (\S+) (\S+) (\S+) (\S+) \{([^}]*)\} \{([^}]*)\} "(\S+) ([^"]+) (\S+)" *$
Results:
Group 1: Feb 6 12:14:14
Group 2: localhost
Group 3: haproxy
Group 4: 14389
Group 5: 10.0.1.2
Group 6: 33317
Group 7: 06/Feb/2009:12:14:14.655
Group 8: http-in
Group 9: static
Group 10: srv1
Group 11: 10/0/30/69/109
Group 12: 200
Group 13: 2750
Group 14: -
Group 15: -
Group 16: ----
Group 17: 1/1/1/1/0
Group 18: 0/0
Group 19: 1wt.eu
Group 20:
Group 21: GET
Group 22: /index.html
Group 23: HTTP/1.1
I use RegexBuddy for composing complex regular expressions.
Use at your own peril.
This assumes that all fields return something except for the ones you have marked with asterisks (is that what the asterisk means)? There are also obvious failure cases such as nested brackets of any kind, but if the logger prints reasonably sane messages, then I guess you'd be okay...
Of course, even I personally wouldn't want to have to maintain this, but there you have it. You might want to consider writing a regular ol' parser for this instead, if you can.
Edit: Marked this as CW since it's more of a "I wonder how this will turn out" kind of answer than anything else. For quick reference, this is what I ended up constructing in rubular:
^[^[]+\s+(\w+)\[(\d+)\]:([^:]+):(\d+)\s+\[([^\]]+)\]\s+[^\s]+\s+(\w+)\/(\w+)\s+(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d*)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)\s+(\d+)\/(\d+)\s+\{([^}]*)\}\s\{([^}]*)\}\s+\"([^"]+)\"$
My first programming language was Perl, and even I'm willing to admit that I'm frightened by that.
That looks like a very complicated string to match on. I would recommend using a tool like Expresso. Start with the string you are trying to match then start replacing pieces of it with Regex notation.
To grab individual pieces, use grouping parentheses.
The other option would be to make a regex for each piece you are trying to grab.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With