I have scripts for big log files. I can check all line and do something with <code>tail</code> and <code>awk</code>. Tail: <pre class="prettyprint"><code>tail -n +$startline $LOG </code></pre> Awk: <pre class="prettyprint"><code>awk 'NR>='"$startline"' {print}' $LOG </code></pre> And checking time, tail working 6 mins 39 seconds, awk working 6 mins 42 seconds. So two commands do same thing / same time. I don't know how to do with sed. Sed can be faster than tail and awk? Or maybe other commands. <hr> Second question, I use <code>$startline</code> and every time continue remains from the last line. For example: I use script line this: <pre class="prettyprint"><code>10:00AM -> ./script -> $startline=1 and do something -> write line number to save file(for ex. 25), 10:05AM -> ./script -> $startline=26(read save file +1) and do something -> write line number save file(55), 10:10AM -> ./script -> $startline=56(read save file +1) and do something .... </code></pre> But when script is running, checking all lines and when see <code>$startline</code>, doing something. And it's a little slow because of huge files. Any suggestions for it be faster? Script example: <pre class="prettyprint"><code>lastline=$(tail -1 "line.save") startline=$(($lastline + 1)) tail -n +$startline $LOG | while read -r do .... done linecount=$(wc -l "$LOG" | awk '{print $1}') echo $linecount >> line.save </code></pre>

<code>tail</code> and <code>head</code> are tools especially created for this purposes, so the intuitive idea is that their are quite optimized for it. On the other hand, <code>awk</code> and <code>sed</code> can perfectly do it because they are like a Swiss Army knife, but this is not supposed to be its best "skill" over the multiple others that they have. In Efficient way to print lines from a massive file using awk, sed, or something else? there is a nice comparison on methods and <code>head</code> / <code>tail</code> is seen as the best approach. Hence, I would go for <code>tail</code> + <code>head</code>. <hr> Note also that if it is not only the last lines, but a set of them within the text, in <code>awk</code> (or in <code>sed</code>) you have the option to <code>exit</code> after the last line you wanted. This way, you avoid the script to run the file until the last line. So this: <pre class="prettyprint"><code>awk '{if (NR>=10 && NR<20) print} NR==20 {print; exit}' </code></pre> is faster than <pre class="prettyprint"><code>awk 'NR>=10 && NR<=20' </code></pre> If your input happens to contain more than 20 lines. <hr> Regarding your expression: <pre class="prettyprint"><code>awk 'NR>='"$startline"' {print}' $LOG </code></pre> note that it is more straight forward to write: <pre class="prettyprint"><code>awk -v start="$startline" 'NR>=start' $LOG </code></pre> there is no need to say <code>print</code> because it is implicit.

Awk, tail, sed or others - which one faster for big files?

Tags:

linux

bash

shell

sed

awk

I have scripts for big log files. I can check all line and do something with tail and awk.

Tail:

tail -n +$startline $LOG

Awk:

awk 'NR>='"$startline"' {print}' $LOG

And checking time, tail working 6 mins 39 seconds, awk working 6 mins 42 seconds. So two commands do same thing / same time.

I don't know how to do with sed. Sed can be faster than tail and awk? Or maybe other commands.

Second question, I use $startline and every time continue remains from the last line. For example:

I use script line this:

10:00AM -> ./script -> $startline=1 and do something -> write line number to save file(for ex. 25),
10:05AM -> ./script -> $startline=26(read save file +1) and do something -> write line number save file(55),
10:10AM -> ./script -> $startline=56(read save file +1) and do something ....

But when script is running, checking all lines and when see $startline, doing something. And it's a little slow because of huge files.

Any suggestions for it be faster?

Script example:

lastline=$(tail -1 "line.save")
startline=$(($lastline + 1))
tail -n +$startline $LOG | while read -r
do
....
done
linecount=$(wc -l "$LOG" | awk '{print $1}')
echo $linecount >> line.save

894

asked Nov 21 '14 08:11

onur

1 Answers

tail and head are tools especially created for this purposes, so the intuitive idea is that their are quite optimized for it. On the other hand, awk and sed can perfectly do it because they are like a Swiss Army knife, but this is not supposed to be its best "skill" over the multiple others that they have.

In Efficient way to print lines from a massive file using awk, sed, or something else? there is a nice comparison on methods and head / tail is seen as the best approach.

Hence, I would go for tail + head.

Note also that if it is not only the last lines, but a set of them within the text, in awk (or in sed) you have the option to exit after the last line you wanted. This way, you avoid the script to run the file until the last line.

So this:

awk '{if (NR>=10 && NR<20) print} NR==20 {print; exit}'

is faster than

awk 'NR>=10 && NR<=20'

If your input happens to contain more than 20 lines.

Regarding your expression:

awk 'NR>='"$startline"' {print}' $LOG

note that it is more straight forward to write:

awk -v start="$startline" 'NR>=start' $LOG

there is no need to say print because it is implicit.

answered Sep 24 '22 12:09

fedorqui 'SO stop harming'

Related questions
                            
                                Prevent backup reads from getting into linux page cache
                            
                                Ultra-low latency programming on Linux, where to begin?
                            
                                Rename all '.' to '_' in a filename except for the extension
                            
                                SYSENTER/SYSEXIT vs INT 0x80
                            
                                *nix: Performing nested -exec with find command
                            
                                Reading remote HDFS file with Java
                            
                                Unzip archive to an existing directory structure
                            
                                How to customize or remove extra Linux kernel version details shown at boot?
                            
                                SSH Key-Forwarding using python paramiko
                            
                                How to move or resize X11 windows (even if they are maximized)?
                            
                                What's a reasonably-secure, reasonably-portable way to allocate memory for private keys? [closed]
                            
                                fail2ban: how unban ip (using fail2ban-client) [closed]
                            
                                Check if pthread_mutex is initialized
                            
                                Patching and compiling Ext4 as a kernel module
                            
                                Should bash scripts use sudo, or assume sudo?
                            
                                Why aren't glibc's function addresses randomized when ASLR is enabled?
                            
                                difference between device file and device driver
                            
                                Adding files to sourcecontrol on linux using cleartool
                            
                                How to get printer dialog in GVim on Linux?
                            
                                mmap a 10 GB file and load it into memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With