Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find which line from first file appears most frequently in second file?

Tags:

bash

frequency

I have two lists. I need to determine which word from the first list appears most frequently in the second list. The first, list1.txt contains a list of words, sorted alphabetically, with no duplicates. I have used some scripts which ensures that each word appears on a unique line, e.g.:

canyon
fish
forest
mountain
river

The second file, list2.txt is in UTF-8 and also contains many items. I have also used some scripts to ensure that each word appears on a unique line, but some items are not words, and some might appear many times, e.g.:

fish
canyon
ocean
ocean
ocean
ocean
1423
fish
109
fish
109
109
ocean
  • The script should output the most frequently matching item. For e.g., if run with the 2 files above, the output would be “fish”, because that word from list1.txt most often occurs in list2.txt.

Here is what I have so far. First, it searches for each word and creates a CSV file with the matches:

#!/bin/bash
while read -r line
do
    count=$(grep -c ^$line list2.txt)
    echo $line”,”$count >> found.csv
done < ./list1.txt

After that, found.csv is sorted descending by the second column. The output is the word appearing on the first line. I do not think though, that this is a good script, because it is not so efficient, and it is possible that there might not be a most frequent matching item, for e.g.:

  • If there is a tie between 2 or more words, e.g. “fish”, “canyon”, and “forest” each appear 5 times, while no other appear as often, the output would be these 3 words in alphabetical order, separated by commas, e.g.: “canyon,fish,forest”.
  • If none of the words from list1.txt appears in list2.txt, then the output is simply the first word from the file list1.txt, e.g. “canyon”.

How can I create a more efficient script which finds which word from the first list appears most often in the second?

like image 576
Village Avatar asked Dec 26 '22 18:12

Village


1 Answers

You can use the following pipeline:

grep -Ff list1.txt list2.txt | sort | uniq -c | sort -n | tail -n1

F tells grep to search literal words, f tells it to use list1.txt as the list of words to search for. The rest sorts the matches, counts duplicates, and sorts them according to the number of occurrences. The last part selects the last line, i.e. the most common one (plus the number of occurrences).

like image 121
choroba Avatar answered Apr 09 '23 23:04

choroba