Where grep finds a short pattern from a pattern file in long lines of a look-up file, I need a tool that would allow me to extract short lines of a lookup file that can be found within a longer pattern.
In other words, given the works of Shakespeare with one sentence per line and say a French dictionary, I want to find which French words are found in which line of Shakespeare, allowing for the detection of the fact that a line of Shakespeare may contain more than one French word and that a French word may appear in more than one line of Shakespeare.
For example:
pattern_file={
"The sun is shining!"
"It is a beautiful day!"}
lookup_file={
"Rain"
"Sun"
"Cloud"
"Beautiful"
"Shining"}
What I would like is
function file pattern
To give both the line that is found in the longer pattern and the longer pattern itself separated by a coma, with multiple matches being detected.
ideal_result_file={
"Sun","The sun is shining!"
"Beautiful","It is a beautiful day!",
"Shining", "The sun is shining!"}
Currently, I loop over the whole lookup file line by line with grep:
while read line
do
grep -is $line pattern_file | sed 's/^/'"$line"'\,/g' >> result_file.csv
done < lookup_file
This is incredibly slow! My lookup_file contains over 50 000 lines while my pattern_file contains 500. Where as using grep to find an even shorter pattern in my lookup_file takes seconds, a single pass using my loop approach takes day/weeks.
Solutions in any language would be appreciated.
Somewhat related to
Very slow loop using grep or fgrep on large datasets
Is Perl faster than bash?
The solution needs to be compatible with GB size loopup and pattern files.
You can use the -f
switch to use a "pattern file" in grep:
egrep -i -f lookup_file pattern_file >> result_file
This will be faster because grep
compiles lookup_file
into a single state machine that checks all matches at the same time, rather than checking each pattern against each line separately.
If your lookup_file consists of text and not regular expressions, you can use fgrep and it will be even faster.
To get your ideal output you can use the -n and -o switches and you get a list of patterns that match each line.
Since you indicated any language is acceptable I will post a completely different approach: with shell scripting you will never beat the performance of in-memory tools or databases. If you have a lot of data you should use databases which are meant for these kind of operations and it scales much better.
So here is a simple example using sqlite (www.sqlite.org).
You need to import your patterns and data into tables, like this for example (you can script this if you want):
CREATE TABLE patterns (pattern TEXT);
CREATE TABLE data (sentence TEXT);
BEGIN;
INSERT INTO patterns VALUES ('Sun');
INSERT INTO patterns VALUES ('Rain');
INSERT INTO patterns VALUES ('Cloud');
INSERT INTO patterns VALUES ('Beautiful');
INSERT INTO data VALUES ('The sun is shining');
INSERT INTO data VALUES ('It is a beautiful day');
INSERT INTO data VALUES ('It is cloudy and the sun shines');
COMMIT;
Then run a select
query to get your desired output:
select pattern, group_concat(sentence) as doesmatch from (
select pattern, sentence, lower(pattern) as lpattern, lower(sentence) as lsentence
from patterns left outer join data
where like('%' || lpattern || '%', lsentence)
) group by pattern;
If you save the first snippet as data.sql
and the second one as query.sql
you use this on the command line:
sqlite3 sentences.db < data.sql # this imports your data, run once
sqlite3 sentences.db < query.sql
This gives you:
Beautiful|It is a beautiful day
Cloud|It is cloudy and the sun shines
Sun|The sun is shining,It is cloudy and the sun shines
which is what you want I believe. To make it more fancy use your favourite more advanced tool with a database library. I would choose python for this.
Suggestions for further improvement:
use regex
instead of like
to filter whole words (i.e. pattern "sun" matches "sun" but not "sunny"),
import utility,
output formatting,
query optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With