I was wondering if there is a more efficient way to get this task done. I am working with files with the number of lines ranging from a couple hundred thousand to a couple million. Say I know that lines 100,000 - 125,000 are the lines that contain the data I am looking for. I would like to know if there is a quick way to pull just these desired lines from the file. Right now I am using a loop with grep like this: <pre class="prettyprint"><code> for ((i=$start_fid; i<=$end_fid; i++)) do grep "^$i " fulldbdir_new >> new_dbdir${bscnt} done </code></pre> Which works fine its just is taking longer than I would like. And the lines contain more than just numbers. Basically each line has about 10 fields with the first being a sequential integer that appears only once per file. I am comfortable writing in C if necessary.

You can try a combination of tail and head to get the correct lines. <pre class="prettyprint"><code>head -n 125000 file_name | tail -n 25001 | grep "^$i " </code></pre> Don't forget perl either. <pre class="prettyprint"><code>perl -ne 'print if $. >= 100000 && $. <= 125000' file_name | grep "^$i " </code></pre> or some faster perl: <pre class="prettyprint"><code>perl -ne 'print if $. >= 100000; exit() if $. >= 100000 && $. <= 125000' | grep "^$i " </code></pre> Also, instead of a for loop you might want to look into using GNU parallel.

I'd use awk: <pre class="prettyprint"><code>awk 'NR >= 100000; NR == 125000 {exit}' file </code></pre> For big numbers you can also use E notation: <pre class="prettyprint"><code>awk 'NR >= 1e5; NR == 1.25e5 {exit}' file </code></pre> EDIT: @glenn jackman's suggestion (cf. comment)

bash pull certain lines from a file

Tags:

bash

file-io

I was wondering if there is a more efficient way to get this task done. I am working with files with the number of lines ranging from a couple hundred thousand to a couple million. Say I know that lines 100,000 - 125,000 are the lines that contain the data I am looking for. I would like to know if there is a quick way to pull just these desired lines from the file. Right now I am using a loop with grep like this:

 for ((i=$start_fid; i<=$end_fid; i++))
  do
    grep "^$i " fulldbdir_new >> new_dbdir${bscnt}
  done

Which works fine its just is taking longer than I would like. And the lines contain more than just numbers. Basically each line has about 10 fields with the first being a sequential integer that appears only once per file.

I am comfortable writing in C if necessary.

870

asked Jul 25 '11 19:07

mike

3 Answers

sed can do the job...

sed -n '100000,125000p' input

EDIT: As per glenn jackman's suggestion, can be adjusted thusly for efficiency...

sed -n '100000,125000p; 125001q' input

157

answered Sep 28 '22 15:09

Costa

You can try a combination of tail and head to get the correct lines.

head -n 125000 file_name | tail -n 25001 | grep "^$i "

Don't forget perl either.

perl -ne 'print if $. >= 100000 && $. <= 125000' file_name | grep "^$i "

or some faster perl:

perl -ne 'print if $. >= 100000; exit() if $. >= 100000 && $. <= 125000' | grep "^$i "

Also, instead of a for loop you might want to look into using GNU parallel.

answered Sep 28 '22 17:09

gpojd

I'd use awk:

awk 'NR >= 100000; NR == 125000 {exit}' file

For big numbers you can also use E notation:

awk 'NR >= 1e5; NR == 1.25e5 {exit}' file

EDIT: @glenn jackman's suggestion (cf. comment)

answered Sep 28 '22 17:09

mhyfritz

Related questions
                            
                                Is there an `rc` file for the command line calculator `bc`?
                            
                                Why doesn't this Bash script error out?
                            
                                Is there a way to run "nvm use" automatically in a prestart npm script?
                            
                                Split file based on string delimiter in bash.how? [duplicate]
                            
                                How to pipe a here-document through a command and capture the result into a variable?
                            
                                Why does this snippet with a shebang #!/bin/sh and exec python inside 4 single quotes work?
                            
                                How do I prevent bash to use a builtin command?
                            
                                Why is testing "$?" to see if a command succeeded or not, an anti-pattern?
                            
                                Syntax error: Bad for loop variable
                            
                                /etc/lsb-release vs /etc/os-release
                            
                                Can I run bash scripts on my Heroku account?
                            
                                Variable in Bash Script that keeps it value from the last time running
                            
                                How do I insert a tab character in Iterm?
                            
                                how to check if sshd runs on a remote machine [closed]
                            
                                Do zsh or bash have quotes that are convenient for English text?
                            
                                How to return exit code 0 from a failed command
                            
                                Parse JSON to array in a shell script
                            
                                How to replace all matches with an incrementing number in BASH?
                            
                                Difference between echo and @echo in unix shells
                            
                                How can I assign the match of my regular expression to a variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With