Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

shell to filter prohibited words on a file

Good day shell lovers!

basically i have two files:

frequency.txt: (multiple lines, space separated file containing words and a frequency)

de 1711
a 936
et 762
la 530
les 482
pour 439
le 425
...

and i have a file containing "prohibited" words:

stopwords.txt: (one single line, space separated file)

 au aux avec le ces dans ...

so i want to delete from frequency.txt all the lines containing a word found on stopwords.txt

how could i do that? i'm thinking that it could be done with awk.. something like

awk 'match($0,SOMETHING_MAGICAL_HERE) == 0 {print $0}' frequency.txt > new.txt

but i'm not really sure... any ideas?? thxs in advance

like image 322
pleasedontbelong Avatar asked Dec 03 '22 10:12

pleasedontbelong


2 Answers

$ awk 'FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))' stop.txt freq.txt
de 1711
a 936
et 762
la 530
les 482
pour 439
like image 123
ghostdog74 Avatar answered Dec 11 '22 15:12

ghostdog74


This will do it for you:

tr ' ' '\n' <stopwords.txt | grep -v -w -F -f - frequency.txt

-v is to invert the match
-w is for whole word matches only
-F is to indicate that pattern is a set of newline separated fixed strings
-f to get the pattern strings from the stopwords.txt file

If you have trouble with that, because it's space delimited, you can use tr to replace spaces with newlines:

like image 26
Michael Goldshteyn Avatar answered Dec 11 '22 17:12

Michael Goldshteyn