nginx access.log. It is delimited by 1) white space 2) [ ] and 3) double quotes.
::1 - - [12/Oct/2021:15:26:25 +0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
::1 - - [12/Oct/2021:15:26:25 +0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
after parsing it supposed to look like
$1 = ::1
$4 = [12/Oct/2021:15:26:25 +0530] or 12/Oct/2021:15:26:25 +0530
$5 = "GET / HTTP/1.1"
$6 = 200
$7 = 1717
$8 = "-"
$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
I tried some options like awk -F'[],] *'
awk -f [][{}]
, but they doesn't work with full line.
nginx access.log shared here is just an example. I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.
Execution of the Linux commands will parse the Nginx access logs for the specified function, whereas you can get a complete report when you parse Nginx access logs with any analyzer tool. This write-up will demonstrate both methods for parsing the access logs of Nginx.
For your Apache access logs (or any other web server that uses the Common or Combined log format) you can use the Apache Access Log Parser. It has been fine-tuned to correctly handle access logs, so you should use this instead of the generic parser to save yourself some time.
The apache-access-log parser of syslog-ng creates a new name-value pair for each field of the log message, and does some additional parsing to get more information. When you have generic columnar logs (for example, a list of tab-separated or comma-separated values), you can parse those using the CSV parser in syslog-ng.
When you have generic columnar logs (for example, a list of tab-separated or comma-separated values), you can parse those using the CSV parser in syslog-ng. For your Apache access logs (or any other web server that uses the Common or Combined log format) you can use the Apache Access Log Parser.
If you can use gnu-awk
you can make use of FPAT to specify the column data:
awk -v FPAT='\\[[^][]*]|"[^"]*"|\\S+' '{
for(i=1; i<=NF; i++) {
print "$"i" = ", $i
}
}' file
The pattern matches:
\\[[^][]*]
Match from an opening [
till closing ]
using a negated character class
|
Or"[^"]*"
Match from an opening till closing double quote|
Or\\S+
1 or more non whitespace charsOutput
$1 = ::1
$2 = -
$3 = -
$4 = [12/Oct/2021:15:26:25 +0530]
$5 = "GET / HTTP/1.1"
$6 = 200
$7 = 1717
$8 = "-"
$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
Since these are nginx logs so their format will be same(OR there are settings by which you can keep logs same, talking about current versions). We can take advantage of this feature moreover we can concentrate on only getting needed parts, so I am using regex here to get only matched values and leave not needed values simple. By following this we need NOT to hardcode the field numbers, using regex will do the trick here.
This should work in any awk
version.
awk '
{
while(match($0,/^::[0-9]+|\[?[0-9]{1,2}\/[a-zA-Z]{3}\/[0-9]{4}(:[0-9]{2}){3}\s+\+[0-9]{4}\]?|"[^"]*"|\s[0-9]{3}\s|[0-9]+\s/)){
val=substr($0,RSTART,RLENGTH)
gsub(/^[[:space:]]+|[[:space:]]+$/,"",val)
print val
$0=substr($0,RSTART+RLENGTH)
}
}' Input_file
This might work for you (GNU sed):
sed -E 'y/ /\n/
:a;s/^(\[[^]\n]*)\n/\1 /m;s/^("[^"\n]*)\n/\1 /m;ta
s/.*/echo '\''&'\'' | cat -n/e
s/^ *(\S)\t/$\1 = /mg' file
Replace all spaces by newlines.
Group all lines that begin and end in either [
and ]
or double quotes and replace newlines by spaces.
Number all the lines.
Remove leading spaces and tabs and format the result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With