Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Awk, tail, sed or others - which one faster for big files?

I have scripts for big log files. I can check all line and do something with tail and awk.

Tail:

tail -n +$startline $LOG

Awk:

awk 'NR>='"$startline"' {print}' $LOG

And checking time, tail working 6 mins 39 seconds, awk working 6 mins 42 seconds. So two commands do same thing / same time.

I don't know how to do with sed. Sed can be faster than tail and awk? Or maybe other commands.


Second question, I use $startline and every time continue remains from the last line. For example:

I use script line this:

10:00AM -> ./script -> $startline=1 and do something -> write line number to save file(for ex. 25),
10:05AM -> ./script -> $startline=26(read save file +1) and do something -> write line number save file(55),
10:10AM -> ./script -> $startline=56(read save file +1) and do something ....

But when script is running, checking all lines and when see $startline, doing something. And it's a little slow because of huge files.

Any suggestions for it be faster?

Script example:

lastline=$(tail -1 "line.save")
startline=$(($lastline + 1))
tail -n +$startline $LOG | while read -r
do
....
done
linecount=$(wc -l "$LOG" | awk '{print $1}')
echo $linecount >> line.save
like image 894
onur Avatar asked Nov 21 '14 08:11

onur


People also ask

Which is faster awk or sed?

I find awk much faster than sed . You can speed up grep if you don't need real regular expressions but only simple fixed strings (option -F). If you want to use grep, sed, awk together in pipes, then I would place the grep command first if possible.

Should I use sed or awk?

Conclusion: Use sed for very simple text parsing. Anything beyond that, awk is better. In fact, you can ditch sed altogether and just use awk. Since their functions overlap and awk can do more, just use awk.

Is TR faster than sed?

Where we can use both sed or tr, we will prefer to use of tr command because the tr is more faster. Of course, in many practical cases, the speed difference is too small to notice.

Why is awk so fast?

Awk is a compiled language. Your Awk script is compiled once and applied to every line of your file at C-like speeds. It is way faster than Python. If you learn to use Awk well, you will start doing things with data that you wouldn't have had the patience to do in an interpreted language.


1 Answers

tail and head are tools especially created for this purposes, so the intuitive idea is that their are quite optimized for it. On the other hand, awk and sed can perfectly do it because they are like a Swiss Army knife, but this is not supposed to be its best "skill" over the multiple others that they have.

In Efficient way to print lines from a massive file using awk, sed, or something else? there is a nice comparison on methods and head / tail is seen as the best approach.

Hence, I would go for tail + head.


Note also that if it is not only the last lines, but a set of them within the text, in awk (or in sed) you have the option to exit after the last line you wanted. This way, you avoid the script to run the file until the last line.

So this:

awk '{if (NR>=10 && NR<20) print} NR==20 {print; exit}'

is faster than

awk 'NR>=10 && NR<=20'

If your input happens to contain more than 20 lines.


Regarding your expression:

awk 'NR>='"$startline"' {print}' $LOG

note that it is more straight forward to write:

awk -v start="$startline" 'NR>=start' $LOG

there is no need to say print because it is implicit.

like image 59
fedorqui 'SO stop harming' Avatar answered Sep 24 '22 12:09

fedorqui 'SO stop harming'