Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an inverse of grep: finding short lines in long patterns?

Where grep finds a short pattern from a pattern file in long lines of a look-up file, I need a tool that would allow me to extract short lines of a lookup file that can be found within a longer pattern.

In other words, given the works of Shakespeare with one sentence per line and say a French dictionary, I want to find which French words are found in which line of Shakespeare, allowing for the detection of the fact that a line of Shakespeare may contain more than one French word and that a French word may appear in more than one line of Shakespeare.

For example:

pattern_file={
"The sun is shining!"
"It is a beautiful day!"}

lookup_file={
"Rain"
"Sun"
"Cloud"
"Beautiful"
"Shining"}

What I would like is

function file pattern

To give both the line that is found in the longer pattern and the longer pattern itself separated by a coma, with multiple matches being detected.

ideal_result_file={
"Sun","The sun is shining!"
"Beautiful","It is a beautiful day!",
"Shining", "The sun is shining!"}

Currently, I loop over the whole lookup file line by line with grep:

    while read line
    do
      grep  -is $line pattern_file | sed 's/^/'"$line"'\,/g' >> result_file.csv
    done < lookup_file

This is incredibly slow! My lookup_file contains over 50 000 lines while my pattern_file contains 500. Where as using grep to find an even shorter pattern in my lookup_file takes seconds, a single pass using my loop approach takes day/weeks.

Solutions in any language would be appreciated.

Somewhat related to
Very slow loop using grep or fgrep on large datasets
Is Perl faster than bash?

The solution needs to be compatible with GB size loopup and pattern files.

like image 345
Etienne Low-Décarie Avatar asked Dec 03 '22 23:12

Etienne Low-Décarie


2 Answers

You can use the -f switch to use a "pattern file" in grep:

egrep -i -f lookup_file pattern_file >> result_file

This will be faster because grep compiles lookup_file into a single state machine that checks all matches at the same time, rather than checking each pattern against each line separately.

If your lookup_file consists of text and not regular expressions, you can use fgrep and it will be even faster.

To get your ideal output you can use the -n and -o switches and you get a list of patterns that match each line.

like image 173
Joni Avatar answered Dec 26 '22 16:12

Joni


Since you indicated any language is acceptable I will post a completely different approach: with shell scripting you will never beat the performance of in-memory tools or databases. If you have a lot of data you should use databases which are meant for these kind of operations and it scales much better.

So here is a simple example using sqlite (www.sqlite.org).

You need to import your patterns and data into tables, like this for example (you can script this if you want):

CREATE TABLE patterns (pattern TEXT);
CREATE TABLE data (sentence TEXT);

BEGIN;

INSERT INTO patterns VALUES ('Sun');
INSERT INTO patterns VALUES ('Rain');
INSERT INTO patterns VALUES ('Cloud');
INSERT INTO patterns VALUES ('Beautiful');

INSERT INTO data VALUES ('The sun is shining');
INSERT INTO data VALUES ('It is a beautiful day');
INSERT INTO data VALUES ('It is cloudy and the sun shines');

COMMIT;

Then run a select query to get your desired output:

select pattern, group_concat(sentence) as doesmatch from (
    select pattern, sentence, lower(pattern) as lpattern, lower(sentence) as lsentence
    from patterns left outer join data
    where like('%' || lpattern || '%', lsentence)
) group by pattern;

If you save the first snippet as data.sql and the second one as query.sql you use this on the command line:

sqlite3 sentences.db < data.sql    # this imports your data, run once
sqlite3 sentences.db < query.sql

This gives you:

Beautiful|It is a beautiful day
Cloud|It is cloudy and the sun shines
Sun|The sun is shining,It is cloudy and the sun shines

which is what you want I believe. To make it more fancy use your favourite more advanced tool with a database library. I would choose python for this.

Suggestions for further improvement:

  • use regex instead of like to filter whole words (i.e. pattern "sun" matches "sun" but not "sunny"),

  • import utility,

  • output formatting,

  • query optimization.

like image 29
Martijn de Milliano Avatar answered Dec 26 '22 15:12

Martijn de Milliano