Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to print a single line in a file

Tags:

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance). There are many ways to do this, i manly use these 2

cat ${file} | head -1 

or

cat ${file} | sed -n '1p' 

I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?

like image 398
JBoy Avatar asked Mar 26 '13 08:03

JBoy


People also ask

Which command is best to print just the first line from a huge file?

The default command which comes to our mind is the head command. head with the option "-1" displays the first line. 2. The best of all options since it uses an internal command.

What is the command to print 10th line of a file?

Where NUM is the number of the line you want to print; so, for example, sed '10q;d' file to print the 10th line of file .


2 Answers

Drop the useless use of cat and do:

$ sed -n '1{p;q}' file 

This will quit the sed script after the line has been printed.


Benchmarking script:

#!/bin/bash  TIMEFORMAT='%3R' n=25 heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')  # files upto a hundred million lines (if your on slow machine decrease!!) for (( j=1; j<=100,000,000;j=j*10 )) do     echo "Lines in file: $j"     # create file containing j lines     seq 1 $j > file     # initial read of file     cat file > /dev/null      for comm in {0..3}     do         avg=0         echo         echo ${heading[$comm]}             for (( i=1; i<=$n; i++ ))         do             case $comm in                 0)                     t=$( { time head -1 file > /dev/null; } 2>&1);;                 1)                     t=$( { time sed -n 1p file > /dev/null; } 2>&1);;                 2)                     t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;                 3)                     t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;             esac             avg=$avg+$t         done         echo "scale=3;($avg)/$n" | bc     done done 

Just save as benchmark.sh and run bash benchmark.sh.

Results:

head -1 file .001  sed -n 1p file .048  sed -n '1{p;q} file .002  read line < file && echo $line 0 

**Results from file with 1,000,000 lines.*

So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:

enter image description here

Note: timings are different from original post due to being on a faster Linux box.

like image 124
13 revs, 2 users 96% Avatar answered Nov 08 '22 18:11

13 revs, 2 users 96%


If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.

The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.

For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.

All of this caching effect "interference" is both OS and hardware dependent.

So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.

this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:

sed: sed '1{p;q}' uopgenl20121216.lis

real    0m0.917s user    0m0.258s sys     0m0.492s 

read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"

real    0m0.017s user    0m0.000s sys     0m0.015s 

This is clearly contrived, but does show the difference between builtin performance vs using a command.

like image 40
jim mcnamara Avatar answered Nov 08 '22 18:11

jim mcnamara