Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex pattern to parse HttpLog format

I am looking for a regex pattern matcher for a String in HttpLogFormat. The log is generated by haproxy. Below is a sample String in this format.

Feb 6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} "GET /index.html HTTP/1.1"

An explanation of the format is available at HttpLogFormat. Any help is appreciated.

I am trying to get the individual peices of information included in that line. Here are the fields:

  1. process_name '[' pid ']:'
  2. client_ip ':' client_port
  3. '[' accept_date ']'
  4. frontend_name
  5. backend_name '/' server_name
  6. Tq '/' Tw '/' Tc '/' Tr '/' Tt*
  7. status_code
  8. bytes_read
  9. captured_request_cookie
  10. captured_response_cookie
  11. termination_state
  12. actconn '/' feconn '/' beconn '/' srv_conn '/' retries
  13. srv_queue '/' backend_queue
  14. '{' captured_request_headers* '}'
  15. '{' captured_response_headers* '}'
  16. '"' http_request '"'
like image 893
Thimmayya Avatar asked Oct 29 '10 19:10

Thimmayya


People also ask

How do I match a pattern in regex?

Using special characters For example, to match a single "a" followed by zero or more "b" s followed by "c" , you'd use the pattern /ab*c/ : the * after "b" means "0 or more occurrences of the preceding item."

What is regex parsing?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.

Should you use regex in a parser?

Most importantly, regexes should not be used for parsing strings. You should instead use or write a bespoke parser. For example, you can't parse HTML with regex (in Python, use BeautifulSoup; in JavaScript, use the DOM).


3 Answers

Regex:

^(\w+ \d+ \S+) (\S+) (\S+)\[(\d+)\]: (\S+):(\d+) \[(\S+)\] (\S+) (\S+)/(\S+) (\S+) (\S+) (\S+) *(\S+) (\S+) (\S+) (\S+) (\S+) \{([^}]*)\} \{([^}]*)\} "(\S+) ([^"]+) (\S+)" *$

Results:

Group 1:    Feb 6 12:14:14
Group 2:    localhost
Group 3:    haproxy
Group 4:    14389
Group 5:    10.0.1.2
Group 6:    33317
Group 7:    06/Feb/2009:12:14:14.655
Group 8:    http-in
Group 9:    static
Group 10:   srv1
Group 11:   10/0/30/69/109
Group 12:   200
Group 13:   2750
Group 14:   -
Group 15:   -
Group 16:   ----
Group 17:   1/1/1/1/0
Group 18:   0/0
Group 19:   1wt.eu
Group 20:   
Group 21:   GET
Group 22:   /index.html
Group 23:   HTTP/1.1

I use RegexBuddy for composing complex regular expressions.

like image 74
Mike Clark Avatar answered Oct 17 '22 19:10

Mike Clark


Use at your own peril.

This assumes that all fields return something except for the ones you have marked with asterisks (is that what the asterisk means)? There are also obvious failure cases such as nested brackets of any kind, but if the logger prints reasonably sane messages, then I guess you'd be okay...

Of course, even I personally wouldn't want to have to maintain this, but there you have it. You might want to consider writing a regular ol' parser for this instead, if you can.

Edit: Marked this as CW since it's more of a "I wonder how this will turn out" kind of answer than anything else. For quick reference, this is what I ended up constructing in rubular:

^[^[]+\s+(\w+)\[(\d+)\]:([^:]+):(\d+)\s+\[([^\]]+)\]\s+[^\s]+\s+(\w+)\/(\w+)\s+(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d*)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)\s+(\d+)\/(\d+)\s+\{([^}]*)\}\s\{([^}]*)\}\s+\"([^"]+)\"$

My first programming language was Perl, and even I'm willing to admit that I'm frightened by that.

like image 41
2 revs Avatar answered Oct 17 '22 19:10

2 revs


That looks like a very complicated string to match on. I would recommend using a tool like Expresso. Start with the string you are trying to match then start replacing pieces of it with Regex notation.

To grab individual pieces, use grouping parentheses.

The other option would be to make a regex for each piece you are trying to grab.

like image 35
Seattle Leonard Avatar answered Oct 17 '22 19:10

Seattle Leonard