Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse Apache log in PHP using preg_match

I need to save data in a table (for reporting, stats etc...) so a user can search by time, user agent etc. I have a script that runs every day that reads the Apache Log and then insert it in the database.

Log format:

10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"

My regex:

preg_match('/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) (\".*?\") (\".*?\")$/',$log, $matches);

Now when I print:

print_r($matches);

Array
(
    [0] => 10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
    [1] => 10.1.1.150
    [2] => -
    [3] => -
    [4] => 29/September/2011
    [5] => 14:21:49
    [6] => -0400
    [7] => GET
    [8] => /info/
    [9] => HTTP/1.1
    [10] => 200
    [11] => 9955
    [12] => "http://www.domain.com/download/"
    [13] => "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
)

I get: "http://www.domain.com/download/" and same for user agent. How can I get rid of these " in the regex? Bonus (Is there any quick way to insert the date/time easily)?

Thanks

like image 577
Tech4Wilco Avatar asked Sep 29 '11 20:09

Tech4Wilco


4 Answers

To parse an Apache access_log log in PHP you can use this regex:

$regex = '/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/';
preg_match($regex ,$log, $matches);

To match the Apache error_log format, you can use this regex:

$regex = '/^\[([^\]]+)\] \[([^\]]+)\] (?:\[client ([^\]]+)\])?\s*(.*)$/i';
preg_match($regex, $log, $matches);
$matches[1] = Date and time,           $matches[2] = severity,
$matches[3] = client addr (if present) $matches[4] = log message

It matches lines with or without the client:

[Tue Feb 28 11:42:31 2012] [notice] Apache/2.4.1 (Unix) mod_ssl/2.4.1 OpenSSL/0.9.8k PHP/5.3.10 configured -- resuming normal operations
[Tue Feb 28 14:34:41 2012] [error] [client 192.168.50.10] Symbolic link not allowed or link target not accessible: /usr/local/apache2/htdocs/x.js
like image 114
drew010 Avatar answered Oct 23 '22 07:10

drew010


If you don't want to capture the double quotes, move them out of the capture groups.

 (\".*?\") 

Should become:

 \"(.*?)\"

As alternative you could just post-process the entries with trim($str, '"')

like image 24
mario Avatar answered Oct 23 '22 08:10

mario


your regexp are wrong. you shoudl use correct regexp

/^(\S+) (\S+) (\S+) - \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/
like image 1
qwertmax Avatar answered Oct 23 '22 07:10

qwertmax


As I've seen and done so many errneous log parsing, here is a hopefully valid regex, tested on 50k lines of logs without any single diff, knowing that:

  • auth_user can have spaces
  • response_size can be -
  • http_start_line can at least one space (HTTP/0.9) or two
  • http_start_line may contain double quotes
  • referrer can be empty, have spaces, or double quotes (it's just an HTTP header)
  • user_agent can be empty too, or contain double quotes, and spaces
  • It's hard to distinguish between referrer and user-agent, let's just home the " " between both is discriminent enough, yet we can find the infamous " " in the referrer and in the user-agent, so basically, we're screwed here.

    $ncsa_re = '/^(?P<IP>\S+)
    \ (?P<ident>\S)
    \ (?P<auth_user>.*?) # Spaces are allowed here, can be empty.
    \ (?P<date>\[[^]]+\])
    \ "(?P<http_start_line>.+ .+)" # At least one space: HTTP 0.9
    \ (?P<status_code>[0-9]+) # Status code is _always_ an integer
    \ (?P<response_size>(?:[0-9]+|-)) # Response size can be -
    \ "(?P<referrer>.*)" # Referrer can contains everything: its just a header
    \ "(?P<user_agent>.*)"$/x';
    

Hope that's help.

like image 1
Julien Palard Avatar answered Oct 23 '22 07:10

Julien Palard