Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use "grep -f file" if "file" has null-delimited items?

Tags:

grep

bash

comm

I need to find null-delimited items from numerous files (data2, data3, ...) that are present in data1. Exact match is required.

All works well with grep -f data1 data2 data3 ... until the items in data1 are also null-delimited.

  1. Using only newlines - ok:

    $ cat data1
    1234
    abcd
    efgh
    5678
    $ cat data2
    1111
    oooo
    abcd
    5678
    $ grep -xFf data1 data2
    abcd
    5678
    
  2. data2 contains null-delimited items - ok when -z used:

    $ printf '1111\0oooo\0abcd\0005678' > data2
    $ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
    abcd
    5678
    
  3. Now both data1 and data2 contain null-delimited items - fail. Seems that the -z option does not apply to the file specified with -f:

    $ printf '1234\0abcd\0efgh\0005678' > data1
    $ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
    
    $
    

The problem is that I do need both files to have null-delimited items. Obvious work-around could be (for example) a good old while loop:

while IFS= read -rd '' line || [[ $line ]]; do
    if grep -zqxF "$line" data2; then
        printf '%s\n' "$line"
    fi
done < data1

But since I have many files with lots of items, this will be painfully slow! Is there a better approach (I do not insist on using grep)?

like image 819
PesaThe Avatar asked Aug 28 '18 15:08

PesaThe


1 Answers

Since order retention isn't important, you're trying to match exact strings, and you have GNU tools available, instead of using fgrep I'd suggest comm -z.

$ printf '%s\0' 1111 oooo abcd 005678 >data2
$ printf '%s\0' 1234 abcd efgh 005678 >data
$ comm -z12 <(sort -uz <data) <(sort -uz <data2) | xargs -0 printf '%s\n'
005678
abcd

If you generate your files sorted in the first place (and thus can leave out the sort operations), this will also have very good memory and performance characteristics.

like image 55
Charles Duffy Avatar answered Sep 22 '22 12:09

Charles Duffy