I have a file called words.txt
containing a list of words. I also have a file called file.txt
containing a sentence per line. I need to quickly delete any lines in file.txt
that contain one of the lines from words.txt
, but only if the match is found somewhere between {
and }
.
E.g. file.txt
:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt
:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between {
and }
.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt
contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f
option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt
directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With