Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse logs ( nginx/apache access.log ) with mix of delimiters i.e. square bracket, space and double quotes? and optionally convert to json

nginx access.log. It is delimited by 1) white space 2) [ ] and 3) double quotes.

::1 - - [12/Oct/2021:15:26:25 +0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
::1 - - [12/Oct/2021:15:26:25 +0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

after parsing it supposed to look like

$1 = ::1

$4 = [12/Oct/2021:15:26:25 +0530] or 12/Oct/2021:15:26:25 +0530

$5 = "GET / HTTP/1.1"

$6 = 200

$7 = 1717

$8 = "-"

$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

I tried some options like awk -F'[],] *' awk -f [][{}] , but they doesn't work with full line.

nginx access.log shared here is just an example. I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.

like image 413
GLK Avatar asked Oct 17 '21 04:10

GLK


People also ask

How to parse Nginx access logs?

Execution of the Linux commands will parse the Nginx access logs for the specified function, whereas you can get a complete report when you parse Nginx access logs with any analyzer tool. This write-up will demonstrate both methods for parsing the access logs of Nginx.

What is the best way to parse Apache access logs?

For your Apache access logs (or any other web server that uses the Common or Combined log format) you can use the Apache Access Log Parser. It has been fine-tuned to correctly handle access logs, so you should use this instead of the generic parser to save yourself some time.

How does the Apache-access-log parser of syslog-ng work?

The apache-access-log parser of syslog-ng creates a new name-value pair for each field of the log message, and does some additional parsing to get more information. When you have generic columnar logs (for example, a list of tab-separated or comma-separated values), you can parse those using the CSV parser in syslog-ng.

How do I parse a log file in Linux?

When you have generic columnar logs (for example, a list of tab-separated or comma-separated values), you can parse those using the CSV parser in syslog-ng. For your Apache access logs (or any other web server that uses the Common or Combined log format) you can use the Apache Access Log Parser.


3 Answers

If you can use gnu-awk you can make use of FPAT to specify the column data:

awk -v FPAT='\\[[^][]*]|"[^"]*"|\\S+' '{
  for(i=1; i<=NF; i++) {
    print "$"i" = ", $i
  }
}' file

The pattern matches:

  • \\[[^][]*] Match from an opening [ till closing ] using a negated character class
  • | Or
  • "[^"]*" Match from an opening till closing double quote
  • | Or
  • \\S+ 1 or more non whitespace chars

Output

$1 =  ::1
$2 =  -
$3 =  -
$4 =  [12/Oct/2021:15:26:25 +0530]
$5 =  "GET / HTTP/1.1"
$6 =  200
$7 =  1717
$8 =  "-"
$9 =  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
like image 84
The fourth bird Avatar answered Oct 07 '22 19:10

The fourth bird


Since these are nginx logs so their format will be same(OR there are settings by which you can keep logs same, talking about current versions). We can take advantage of this feature moreover we can concentrate on only getting needed parts, so I am using regex here to get only matched values and leave not needed values simple. By following this we need NOT to hardcode the field numbers, using regex will do the trick here.

This should work in any awk version.

awk '
{
  while(match($0,/^::[0-9]+|\[?[0-9]{1,2}\/[a-zA-Z]{3}\/[0-9]{4}(:[0-9]{2}){3}\s+\+[0-9]{4}\]?|"[^"]*"|\s[0-9]{3}\s|[0-9]+\s/)){
    val=substr($0,RSTART,RLENGTH)
    gsub(/^[[:space:]]+|[[:space:]]+$/,"",val)
    print val
    $0=substr($0,RSTART+RLENGTH)
  }
}'  Input_file
like image 41
RavinderSingh13 Avatar answered Oct 07 '22 19:10

RavinderSingh13


This might work for you (GNU sed):

sed -E 'y/ /\n/
       :a;s/^(\[[^]\n]*)\n/\1 /m;s/^("[^"\n]*)\n/\1 /m;ta
       s/.*/echo '\''&'\'' | cat -n/e
       s/^  *(\S)\t/$\1 = /mg' file

Replace all spaces by newlines.

Group all lines that begin and end in either [ and ] or double quotes and replace newlines by spaces.

Number all the lines.

Remove leading spaces and tabs and format the result.

like image 34
potong Avatar answered Oct 07 '22 21:10

potong