Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is grep so slow and memory intensive with -w (--word-regexp) flag?

I have a list of ids in a file and a data file (of ~3.2Gb in size), and I want to extract the lines in the data file that contain the id and also the next line. I did the following:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

This worked, but also extracted unwanted substrings, for example if the id is EA4 it also pulled out the lines with EA40.

So I tried using the same command but adding the -w (--word-regexp) flag to the first grep to match whole words. However, I found my command now ran for >1 hour (rather than ~26 seconds) and also started using 10s of gigabytes of memory, so I had to kill the job.

Why did adding -w make the command so slow and memory grabbing? How can I efficiently run this command to get my desired output? Thank you

file.ids looks likes this:

>EA4
>EA9

file.data looks like this:

>EA4 text
data
>E40 blah
more_data
>EA9 text_again
data_here

output.data would look like this:

>EA4 text
data
>EA9 text_again
data_here
like image 248
Chris_Rands Avatar asked Oct 06 '16 10:10

Chris_Rands


People also ask

Which is faster regexp match or regexp test?

Use . test if you want a faster boolean check. Use . match to retrieve all matches when using the g global flag.

Why is regex so slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Does regex affect performance?

Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.

Is regex matching slow?

Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS). At TextMaster we used regular expressions in an important part of our translation platform.


1 Answers

grep -F string file is simply looking for occurrences of string in the file but grep -w -F string file has to check each character before and after string too to see if they are word characters or not. That's a lot of extra work and one possible implementation of it would be to first separate lines into every possible non-word-character-delimited string with overlaps of course so that could take up a lot of memory but idk if that's what's causing your memory usage or not.

In any case, grep is simply the wrong tool for this job since you only want to match against a specific field in the input file, you should be using awk instead:

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
data
>EA9 text_again
data_here

The above assumes your "data" lines cannot start with >. If they can then tell us how to identify data lines vs id lines.

Note that the above will work no matter how many data lines you have between id lines, even if there's 0 or 100:

$ cat file.data
>EA4 text
>E40 blah
more_data
>EA9 text_again
data 1
data 2
data 3

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
>EA9 text_again
data 1
data 2
data 3

Also, you don't need to pipe the output to grep -v:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

just do it all in the one script:

awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f && !/^-/' file.ids file.data
like image 132
Ed Morton Avatar answered Sep 28 '22 06:09

Ed Morton