I was wondering if there is a more efficient way to get this task done. I am working with files with the number of lines ranging from a couple hundred thousand to a couple million. Say I know that lines 100,000 - 125,000 are the lines that contain the data I am looking for. I would like to know if there is a quick way to pull just these desired lines from the file. Right now I am using a loop with grep like this:
for ((i=$start_fid; i<=$end_fid; i++))
do
grep "^$i " fulldbdir_new >> new_dbdir${bscnt}
done
Which works fine its just is taking longer than I would like. And the lines contain more than just numbers. Basically each line has about 10 fields with the first being a sequential integer that appears only once per file.
I am comfortable writing in C if necessary.
The grep command searches through the file, looking for matches to the pattern specified. To use it type grep , then the pattern we're searching for and finally the name of the file (or files) we're searching in. The output is the three lines in the file that contain the letters 'not'.
sed
can do the job...
sed -n '100000,125000p' input
EDIT: As per glenn jackman's suggestion, can be adjusted thusly for efficiency...
sed -n '100000,125000p; 125001q' input
You can try a combination of tail and head to get the correct lines.
head -n 125000 file_name | tail -n 25001 | grep "^$i "
Don't forget perl either.
perl -ne 'print if $. >= 100000 && $. <= 125000' file_name | grep "^$i "
or some faster perl:
perl -ne 'print if $. >= 100000; exit() if $. >= 100000 && $. <= 125000' | grep "^$i "
Also, instead of a for loop you might want to look into using GNU parallel.
I'd use awk:
awk 'NR >= 100000; NR == 125000 {exit}' file
For big numbers you can also use E notation:
awk 'NR >= 1e5; NR == 1.25e5 {exit}' file
EDIT: @glenn jackman's suggestion (cf. comment)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With