I have two lists. I need to determine which word from the first list appears most frequently in the second list. The first, list1.txt
contains a list of words, sorted alphabetically, with no duplicates. I have used some scripts which ensures that each word appears on a unique line, e.g.:
canyon
fish
forest
mountain
river
The second file, list2.txt
is in UTF-8 and also contains many items. I have also used some scripts to ensure that each word appears on a unique line, but some items are not words, and some might appear many times, e.g.:
fish
canyon
ocean
ocean
ocean
ocean
1423
fish
109
fish
109
109
ocean
list1.txt
most often occurs in list2.txt
.Here is what I have so far. First, it searches for each word and creates a CSV file with the matches:
#!/bin/bash
while read -r line
do
count=$(grep -c ^$line list2.txt)
echo $line”,”$count >> found.csv
done < ./list1.txt
After that, found.csv
is sorted descending by the second column. The output is the word appearing on the first line.
I do not think though, that this is a good script, because it is not so efficient, and it is possible that there might not be a most frequent matching item, for e.g.:
list1.txt
appears in list2.txt
, then the output is simply the first word from the file list1.txt
, e.g. “canyon”.How can I create a more efficient script which finds which word from the first list appears most often in the second?
You can use the following pipeline:
grep -Ff list1.txt list2.txt | sort | uniq -c | sort -n | tail -n1
F
tells grep to search literal words, f
tells it to use list1.txt
as the list of words to search for. The rest sorts the matches, counts duplicates, and sorts them according to the number of occurrences. The last part selects the last line, i.e. the most common one (plus the number of occurrences).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With