nginx access.log. It is delimited by 1) white space 2) [ ] and 3) double quotes. <pre class="prettyprint"><code>::1 - - [12/Oct/2021:15:26:25 +0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36" ::1 - - [12/Oct/2021:15:26:25 +0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36" </code></pre> after parsing it supposed to look like <code>$1 = ::1</code> <code>$4 = [12/Oct/2021:15:26:25 +0530] or 12/Oct/2021:15:26:25 +0530</code> <code>$5 = "GET / HTTP/1.1"</code> <code>$6 = 200</code> <code>$7 = 1717</code> <code>$8 = "-"</code> <code>$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"</code> I tried some options like <code>awk -F'[],] *'</code> <code>awk -f [][{}] </code>, but they doesn't work with full line. nginx access.log shared here is just an example. I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.

If you can use <code>gnu-awk</code> you can make use of FPAT to specify the column data: <pre class="prettyprint"><code>awk -v FPAT='\\[[^][]*]|"[^"]*"|\\S+' '{ for(i=1; i<=NF; i++) { print "$"i" = ", $i } }' file </code></pre> The pattern matches: <ul> <li> <code>\\[[^][]*]</code> Match from an opening <code>[</code> till closing <code>]</code> using a negated character class </li> <li> <code>|</code> Or</li> <li> <code>"[^"]*"</code> Match from an opening till closing double quote</li> <li> <code>|</code> Or</li> <li> <code>\\S+</code> 1 or more non whitespace chars</li> </ul> Output <pre class="prettyprint"><code>$1 = ::1 $2 = - $3 = - $4 = [12/Oct/2021:15:26:25 +0530] $5 = "GET / HTTP/1.1" $6 = 200 $7 = 1717 $8 = "-" $9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36" </code></pre>

This might work for you (GNU sed): <pre class="prettyprint"><code>sed -E 'y/ /\n/ :a;s/^(\[[^]\n]*)\n/\1 /m;s/^("[^"\n]*)\n/\1 /m;ta s/.*/echo '\''&'\'' | cat -n/e s/^ *(\S)\t/$\1 = /mg' file </code></pre> Replace all spaces by newlines. Group all lines that begin and end in either <code>[</code> and <code>]</code> or double quotes and replace newlines by spaces. Number all the lines. Remove leading spaces and tabs and format the result.

How to parse logs ( nginx/apache access.log ) with mix of delimiters i.e. square bracket, space and double quotes? and optionally convert to json

Tags:

regex

nginx

sed

apache

awk

nginx access.log. It is delimited by 1) white space 2) [ ] and 3) double quotes.

::1 - - [12/Oct/2021:15:26:25 +0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
::1 - - [12/Oct/2021:15:26:25 +0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

after parsing it supposed to look like

$1 = ::1

$4 = [12/Oct/2021:15:26:25 +0530] or 12/Oct/2021:15:26:25 +0530

$5 = "GET / HTTP/1.1"

$6 = 200

$7 = 1717

$8 = "-"

$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

I tried some options like awk -F'[],] *' awk -f [][{}] , but they doesn't work with full line.

nginx access.log shared here is just an example. I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.

413

asked Oct 17 '21 04:10

GLK

3 Answers

If you can use gnu-awk you can make use of FPAT to specify the column data:

awk -v FPAT='\\[[^][]*]|"[^"]*"|\\S+' '{
  for(i=1; i<=NF; i++) {
    print "$"i" = ", $i
  }
}' file

The pattern matches:

\\[[^][]*] Match from an opening [ till closing ] using a negated character class
| Or
"[^"]*" Match from an opening till closing double quote
| Or
\\S+ 1 or more non whitespace chars

Output

$1 =  ::1
$2 =  -
$3 =  -
$4 =  [12/Oct/2021:15:26:25 +0530]
$5 =  "GET / HTTP/1.1"
$6 =  200
$7 =  1717
$8 =  "-"
$9 =  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

answered Oct 07 '22 19:10

The fourth bird

Since these are nginx logs so their format will be same(OR there are settings by which you can keep logs same, talking about current versions). We can take advantage of this feature moreover we can concentrate on only getting needed parts, so I am using regex here to get only matched values and leave not needed values simple. By following this we need NOT to hardcode the field numbers, using regex will do the trick here.

This should work in any awk version.

awk '
{
  while(match($0,/^::[0-9]+|\[?[0-9]{1,2}\/[a-zA-Z]{3}\/[0-9]{4}(:[0-9]{2}){3}\s+\+[0-9]{4}\]?|"[^"]*"|\s[0-9]{3}\s|[0-9]+\s/)){
    val=substr($0,RSTART,RLENGTH)
    gsub(/^[[:space:]]+|[[:space:]]+$/,"",val)
    print val
    $0=substr($0,RSTART+RLENGTH)
  }
}'  Input_file

answered Oct 07 '22 19:10

RavinderSingh13

This might work for you (GNU sed):

sed -E 'y/ /\n/
       :a;s/^(\[[^]\n]*)\n/\1 /m;s/^("[^"\n]*)\n/\1 /m;ta
       s/.*/echo '\''&'\'' | cat -n/e
       s/^  *(\S)\t/$\1 = /mg' file

Replace all spaces by newlines.

Group all lines that begin and end in either [ and ] or double quotes and replace newlines by spaces.

Number all the lines.

Remove leading spaces and tabs and format the result.

answered Oct 07 '22 21:10

potong

Related questions
                            
                                Regex for password policy [closed]
                            
                                Regex split numbers and letter groups without spaces
                            
                                How to use beginning and endline markers in regex for Java String?
                            
                                Remove/replace html tags in bash
                            
                                Comparing speed of non-matching regexp
                            
                                Regex negative lookbehinds with a wildcard
                            
                                PHP - replace <img> tags and return src
                            
                                Python regular expression to filter list of strings matching a pattern
                            
                                Does space in square brakets have a different meaning than just space?
                            
                                Why does this regex make Chrome hang?
                            
                                Performing regex capture and then substitute using SED/PERL
                            
                                Splitting comma delimited strings in python
                            
                                Java - Parsing strings - String.split() versus Pattern & Matcher
                            
                                Can't use a string pattern on a bytes-like object - python's re error
                            
                                End of line char ($) doesn't work inside square brackets
                            
                                Add a character ) to the end of every lines in Notepad++
                            
                                Replace all digits except the last four using regex.
                            
                                When to use re.compile
                            
                                Regex to remove leading zeros in R, unless the final (or only) character is zero
                            
                                Extracting part of path containing a number in bash

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With