I have a list of ids in a file and a data file (of ~3.2Gb in size), and I want to extract the lines in the data file that contain the id and also the next line. I did the following: <pre class="prettyprint"><code>grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data </code></pre> This worked, but also extracted unwanted substrings, for example if the id is <code>EA4</code> it also pulled out the lines with <code>EA40</code>. So I tried using the same command but adding the <code>-w</code> (<code>--word-regexp</code>) flag to the first <code>grep</code> to match whole words. However, I found my command now ran for >1 hour (rather than ~26 seconds) and also started using 10s of gigabytes of memory, so I had to kill the job. Why did adding <code>-w</code> make the command so slow and memory grabbing? How can I efficiently run this command to get my desired output? Thank you <code>file.ids</code> looks likes this: <pre class="prettyprint"><code>>EA4 >EA9 </code></pre> <code>file.data</code> looks like this: <pre class="prettyprint"><code>>EA4 text data >E40 blah more_data >EA9 text_again data_here </code></pre> <code>output.data</code> would look like this: <pre class="prettyprint"><code>>EA4 text data >EA9 text_again data_here </code></pre>

<code>grep -F string file</code> is simply looking for occurrences of <code>string</code> in the file but <code>grep -w -F string file</code> has to check each character before and after <code>string</code> too to see if they are word characters or not. That's a lot of extra work and one possible implementation of it would be to first separate lines into every possible non-word-character-delimited string with overlaps of course so that could take up a lot of memory but idk if that's what's causing your memory usage or not. In any case, grep is simply the wrong tool for this job since you only want to match against a specific field in the input file, you should be using awk instead: <pre class="prettyprint"><code>$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data >EA4 text data >EA9 text_again data_here </code></pre> The above assumes your "data" lines cannot start with <code>></code>. If they can then tell us how to identify data lines vs id lines. Note that the above will work no matter how many <code>data</code> lines you have between <code>id</code> lines, even if there's 0 or 100: <pre class="prettyprint"><code>$ cat file.data >EA4 text >E40 blah more_data >EA9 text_again data 1 data 2 data 3 $ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data >EA4 text >EA9 text_again data 1 data 2 data 3 </code></pre> Also, you don't need to pipe the output to <code>grep -v</code>: <pre class="prettyprint"><code>grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data </code></pre> just do it all in the one script: <pre class="prettyprint"><code>awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f && !/^-/' file.ids file.data </code></pre>

Why is grep so slow and memory intensive with -w (--word-regexp) flag?

Tags:

grep

bash

shell

unix

awk

I have a list of ids in a file and a data file (of ~3.2Gb in size), and I want to extract the lines in the data file that contain the id and also the next line. I did the following:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

This worked, but also extracted unwanted substrings, for example if the id is EA4 it also pulled out the lines with EA40.

So I tried using the same command but adding the -w (--word-regexp) flag to the first grep to match whole words. However, I found my command now ran for >1 hour (rather than ~26 seconds) and also started using 10s of gigabytes of memory, so I had to kill the job.

Why did adding -w make the command so slow and memory grabbing? How can I efficiently run this command to get my desired output? Thank you

file.ids looks likes this:

>EA4
>EA9

file.data looks like this:

>EA4 text
data
>E40 blah
more_data
>EA9 text_again
data_here

output.data would look like this:

>EA4 text
data
>EA9 text_again
data_here

248

asked Oct 06 '16 10:10

Chris_Rands

1 Answers

grep -F string file is simply looking for occurrences of string in the file but grep -w -F string file has to check each character before and after string too to see if they are word characters or not. That's a lot of extra work and one possible implementation of it would be to first separate lines into every possible non-word-character-delimited string with overlaps of course so that could take up a lot of memory but idk if that's what's causing your memory usage or not.

In any case, grep is simply the wrong tool for this job since you only want to match against a specific field in the input file, you should be using awk instead:

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
data
>EA9 text_again
data_here

The above assumes your "data" lines cannot start with >. If they can then tell us how to identify data lines vs id lines.

Note that the above will work no matter how many data lines you have between id lines, even if there's 0 or 100:

$ cat file.data
>EA4 text
>E40 blah
more_data
>EA9 text_again
data 1
data 2
data 3

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
>EA9 text_again
data 1
data 2
data 3

Also, you don't need to pipe the output to grep -v:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

just do it all in the one script:

awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f && !/^-/' file.ids file.data

132

answered Sep 28 '22 06:09

Ed Morton

Related questions
                            
                                Faulty tail syntax or grep command?
                            
                                How to validate a IPv6 address format with shell?
                            
                                Defining common variables across multiple scripts?
                            
                                video orientation detection in bash
                            
                                bash - pass script as argument of another script
                            
                                Delete the last word of a line in shell
                            
                                Use bash script $1 argument in awk command
                            
                                In a bash script, how can I tell if the script output is redirected to a file?
                            
                                How to do "else" in bash case command?
                            
                                How to find unique words from file linux
                            
                                What's the difference between <<EOF and <<\EOF heredocs in shell
                            
                                How do I loop through all of my Git repositories and update them?
                            
                                In a bash function, how do I get stdin into a variable
                            
                                How can I get the default browser name in bash script on Mac OS X
                            
                                Bash Tab Completion of Filenames after Arguments
                            
                                How to execute Zsh shell commands in Bash Script
                            
                                tee command not working as expected (with read and echo)
                            
                                Bash hashmap using quote as key
                            
                                Search and replace a multi-line pattern with sed
                            
                                How to remove the literal string "\n" (not newlines) from a variable in bash?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With