I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance). There are many ways to do this, i manly use these 2 <pre class="prettyprint"><code>cat ${file} | head -1 </code></pre> or <pre class="prettyprint"><code>cat ${file} | sed -n '1p' </code></pre> I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?

Drop the useless use of <code>cat</code> and do: <pre class="prettyprint lang-bsh prettyprint-override"><code>$ sed -n '1{p;q}' file </code></pre> This will quit the <code>sed</code> script after the line has been printed. <hr> Benchmarking script: <pre class="prettyprint lang-bsh prettyprint-override"><code>#!/bin/bash TIMEFORMAT='%3R' n=25 heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line') # files upto a hundred million lines (if your on slow machine decrease!!) for (( j=1; j<=100,000,000;j=j*10 )) do echo "Lines in file: $j" # create file containing j lines seq 1 $j > file # initial read of file cat file > /dev/null for comm in {0..3} do avg=0 echo echo ${heading[$comm]} for (( i=1; i<=$n; i++ )) do case $comm in 0) t=$( { time head -1 file > /dev/null; } 2>&1);; 1) t=$( { time sed -n 1p file > /dev/null; } 2>&1);; 2) t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);; 3) t=$( { time read line < file && echo $line > /dev/null; } 2>&1);; esac avg=$avg+$t done echo "scale=3;($avg)/$n" | bc done done </code></pre> Just save as <code>benchmark.sh</code> and run <code>bash benchmark.sh</code>. Results: <pre class="prettyprint lang-bsh prettyprint-override"><code>head -1 file .001 sed -n 1p file .048 sed -n '1{p;q} file .002 read line < file && echo $line 0 </code></pre> **Results from file with 1,000,000 lines.* So the times for <code>sed -n 1p</code> will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line: <img src="https://i.stack.imgur.com/tlNhW.png" alt="enter image description here"> Note: timings are different from original post due to being on a faster Linux box.

Fastest way to print a single line in a file

Tags:

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance). There are many ways to do this, i manly use these 2

cat ${file} | head -1

cat ${file} | sed -n '1p'

I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?

398

asked Mar 26 '13 08:03

JBoy

2 Answers

Drop the useless use of cat and do:

$ sed -n '1{p;q}' file

This will quit the sed script after the line has been printed.

Benchmarking script:

#!/bin/bash  TIMEFORMAT='%3R' n=25 heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')  # files upto a hundred million lines (if your on slow machine decrease!!) for (( j=1; j<=100,000,000;j=j*10 )) do     echo "Lines in file: $j"     # create file containing j lines     seq 1 $j > file     # initial read of file     cat file > /dev/null      for comm in {0..3}     do         avg=0         echo         echo ${heading[$comm]}             for (( i=1; i<=$n; i++ ))         do             case $comm in                 0)                     t=$( { time head -1 file > /dev/null; } 2>&1);;                 1)                     t=$( { time sed -n 1p file > /dev/null; } 2>&1);;                 2)                     t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;                 3)                     t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;             esac             avg=$avg+$t         done         echo "scale=3;($avg)/$n" | bc     done done

Just save as benchmark.sh and run bash benchmark.sh.

Results:

head -1 file .001  sed -n 1p file .048  sed -n '1{p;q} file .002  read line < file && echo $line 0

**Results from file with 1,000,000 lines.*

So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:

enter image description here

Note: timings are different from original post due to being on a faster Linux box.

124

answered Nov 08 '22 18:11

13 revs, 2 users 96%

If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.

The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.

For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.

All of this caching effect "interference" is both OS and hardware dependent.

So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.

this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:

sed: sed '1{p;q}' uopgenl20121216.lis

real    0m0.917s user    0m0.258s sys     0m0.492s

read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"

real    0m0.017s user    0m0.000s sys     0m0.015s

This is clearly contrived, but does show the difference between builtin performance vs using a command.

answered Nov 08 '22 18:11

jim mcnamara

Related questions
                            
                                Can a TextView be selectable AND contain links?
                            
                                Get browser width and height after user resizes the window
                            
                                Generate classes with jaxb2-maven-plugin from WSDL
                            
                                Problems with Apache servers and A LOT of httpd processes
                            
                                getDatabase called recursively
                            
                                Difference between PCA (Principal Component Analysis) and Feature Selection
                            
                                How do i diff two files from the web
                            
                                knockout - HTML href
                            
                                Marionette ItemView how to re-render model on change
                            
                                Android get layout height and width in a fragment
                            
                                Finding label location in a DataFrame Index
                            
                                Java FX fxml on Action

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With