I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow. Here is the awk script: <pre class="prettyprint"> #!/bin/bash awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1 </pre> <hr> EDIT: For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility <code>date</code> recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP. Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)" Returned output: 189.5.56.113,124237889

@OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion <pre class="prettyprint"><code>awk 'BEGIN{ m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|") for(o=1;o<=m;o++){ date[d[o]]=sprintf("%02d",o) } } { gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5) n=split($4, DATE,"/") day=DATE[1] mth=DATE[2] year=DATE[3] hr=DATE[4] min=DATE[5] sec=DATE[6] MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec) print $1,MKTIME }' file </code></pre> output <pre class="prettyprint"><code>$ more file 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)" $ ./shell.sh 189.5.56.113 1264110895 </code></pre>

If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortable writing code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds. You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.

If you are using <code>gawk</code>, you can massage your date and time into a format that <code>mktime</code> (a <code>gawk</code> function) understands. It will give you the same timestamp you're using now and save you the overhead of repeated <code>system()</code> calls.

Processing apache logs quickly

I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.

Here is the awk script:

#!/bin/bash

awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1

EDIT:

For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility date recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.

Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"

Returned output: 189.5.56.113,124237889

What is the typical log file size per 10k requests in an Apache Web server?

The access log file typically grows 1 MB or more per 10,000 requests.

How do I pull Apache logs?

You can access Apache logs from var/log/log_type. For example, you can access Apache logs from the Apache Unix/Linux server by looking in the following directories: /var/log/apache/access. log.

@OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion

awk 'BEGIN{
   m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
   for(o=1;o<=m;o++){
      date[d[o]]=sprintf("%02d",o)
    }
}
{
    gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5)
    n=split($4, DATE,"/")
    day=DATE[1]
    mth=DATE[2]
    year=DATE[3]
    hr=DATE[4]
    min=DATE[5]
    sec=DATE[6]
    MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
    print $1,MKTIME

}' file

output

$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh    
189.5.56.113 1264110895

If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortable writing code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.

You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.

If you are using gawk, you can massage your date and time into a format that mktime (a gawk function) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system() calls.

This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):

import time

src = open('x.log', 'r')
dest = open('x.csv', 'w')

for line in src:
    ip = line[:line.index(' ')]
    date = line[line.index('[') + 1:line.index(']') - 6]
    t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
    dest.write(ip)
    dest.write(',')
    dest.write(str(int(t)))
    dest.write('\n')

src.close()
dest.close()

A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.

But to be honest, something as simple as that should be just as easy to rewrite in C.

Processing apache logs quickly

Tags:

apache

awk

large-data-volumes

konr

People also ask

4 Answers

ghostdog74

Dietrich Epp

Dennis Williamson

Max Shawabkeh

Recent Activity

Donate For Us

Processing apache logs quickly

Tags:

apache

awk

large-data-volumes

konr

People also ask

4 Answers

ghostdog74

Dietrich Epp

Dennis Williamson

Max Shawabkeh

Related questions

Recent Activity

Donate For Us