Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the number of matched terms in a text file?

I am trying to count the number of matched terms from an input list containing one term per line with a data file and create an output file containing the matched (grep'd) term with the number of matches.

The input_list.txt looks like this:

+ 5S_rRNA
+ 7SK
+ AADAC
+ AC000111.3
+ AC000111.6

The data.txt file:

chr10   101780038   101780209   5S_rRNA
chr10   103578280   103578430   5S_rRNA
chr10   112327234   112327297   5S_rRNA
chr10   120766459   120766601   7SK
chr10   127408228   127408317   7SK
chr10   127511874   127512063   AADAC
chr10   14614140    14614294    AC000111.3
chr10   14695964    14696146    AC000111.6

I would like to create an output file (output.txt) containing the matched terms with their corresponding count.

+ 5S_rRNA   3
+ 7SK   2
+ AADAC 1
+ AC000111.3    1
+ AC000111.6    1

So far, I've produced a list containing all the matched terms using the following script but all attempts to provide a count of the matched terms haven't worked.

    exec < input_list.txt
    while read line
    do
                grep -w data.txt | awk '{print $0}'| sort| uniq  >> grep_output.txt
    done

I have tried grep -o -w | wc -l and grep -w data.txt | wc -l etc but I can't work out how to produce an output list containing the matched term with its corresponding count.

Any suggestions would be great!

like image 976
user1879573 Avatar asked Dec 03 '13 10:12

user1879573


3 Answers

You can grep the words from the input.txt and use uniq to get the counts:

cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c

Gives:

  3 5S_rRNA
  2 7SK
  1 AADAC
  1 AC000111.3
  1 AC000111.6

You can also add another sed to get formatted output:

cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c | \
      sed 's/\s*\([0-9]*\)\s*\(.*\)/+ \2\t\1/'

Produces:

+ 5S_rRNA   3
+ 7SK   2
+ AADAC 1
+ AC000111.3    1
+ AC000111.6    1
like image 104
perreal Avatar answered Oct 06 '22 00:10

perreal


awk can be good for this:

$ awk 'NR==FNR {vals[$2]=$2}
       $4 in vals {count[$4]++}
       END {for (i in count) print i, count[i]}' input_list data.txt
AC000111.3 1
AC000111.6 1
5S_rRNA 3
AADAC 1
7SK 2

Explanation

vals[] stores the second field of the input_list file. Then, it checks if the 4th field of the second file data.txt is in any line and counts the occurences in count[] array. Finally it prints the output in the END{} block.

Piping to sort with n (numeric) r (reverse) and k2 (2nd column) options, you get sorted data:

$ awk 'NR==FNR {vals[$2]=$2}
       $4 in vals {count[$4]++}
       END {for (i in count) print i, count[i]}' input_list data.txt | sort -rnk2
5S_rRNA 3
7SK 2
AC000111.6 1
AC000111.3 1
AADAC 1
like image 36
fedorqui 'SO stop harming' Avatar answered Oct 05 '22 23:10

fedorqui 'SO stop harming'


perl -lane '$s{ $F[3] }++ END{ print "+ $_ $s{$_}" for sort keys %s }' data.txt
like image 31
mpapec Avatar answered Oct 06 '22 00:10

mpapec