Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count occurrences of a list of words in a text file

Tags:

bash

I have two text files, File1 looks like this:

apple
dog
cat
..
..

and File2 looks like this:

appledogtree 
dog
catapple
apple00001
..
..

I want to count the occurrences of the list of words from File1 in File2, and get a result like below:

(words in File1, number of occurrences in File2)

apple 3
dog 2
cat 1

How can I do this by using Bash command line?

like image 753
Alibuda Avatar asked Jan 30 '17 04:01

Alibuda


3 Answers

You can use fgrep to do this efficiently:

fgrep -of f1.txt f2.txt | sort | uniq -c | awk '{print $2 " " $1}'

Gives this output:

apple 3
cat 1
dog 2
  • fgrep -of f1.txt f2.txt extracts all the matching parts (-o option) of f2.txt based on the patterns in f1.txt
  • sort | uniq -c counts the matching patterns
  • finally, awk swaps the order of words in uniq -c output
like image 60
codeforester Avatar answered Oct 06 '22 05:10

codeforester


Given:

$ cat f1.txt
apple
dog
cat
$ cat f2.txt
appledogtree 
dog
catapple
apple00001

Try:

while IFS= read -r line || [[ -n $line ]]; do 
    printf "%s->%s\n" $line "$(grep -c $line f2.txt)"
done <f1.txt

Prints:

apple->3
dog->2
cat->1

If you want a pipeline, you can do:

cat f1.txt | xargs | sed -e 's/ /\|/g' | grep -Eof /dev/stdin f2.txt | awk '{a[$1]++} END{for (x in a) print x, a[x]}'

Which does:

  1. cat f1.txt puts the contents of the file to stdin;
  2. xargs translates that to one line;
  3. sed -e 's/ /\|/g' joins the words into "apple|dog|cat";
  4. grep -Eof /dev/stdin f2.txt uses that pattern to print the matches of the pattern;
  5. awk '{a[$1]++} END{for (x in a) print x, a[x]}' counts the words and prints the count.

With GNU grep, you can do grep -Eof - f2.txt

That pipeline works on POSIX and Linux...


If you want pure efficiency just use awk:

awk 'NR==FNR {pat[FNR]=$1; next} 
             {for (i in pat){ if(match($0, pat[i])){m[pat[i]]++}}} 
             END{for(e in m){print e,m[e]}}'  f1.txt f2.txt
like image 22
dawg Avatar answered Oct 06 '22 04:10

dawg


In awk:

$ awk 'NR==FNR { a[$1]; next }                  # read in all search words
               { for(i in a) a[i]+=gsub(i,i) }  # count matches of all keywords in record
            END{ for(i in a) print i,a[i] }     # output results
' file1 file2
apple 3
cat 1
dog 2
like image 41
James Brown Avatar answered Oct 06 '22 04:10

James Brown