Count the number of matched terms in a text file?

Question

I am trying to count the number of matched terms from an input list containing one term per line with a data file and create an output file containing the matched (grep'd) term with the number of matches.

The input_list.txt looks like this:

+ 5S_rRNA
+ 7SK
+ AADAC
+ AC000111.3
+ AC000111.6

The data.txt file:

chr10   101780038   101780209   5S_rRNA
chr10   103578280   103578430   5S_rRNA
chr10   112327234   112327297   5S_rRNA
chr10   120766459   120766601   7SK
chr10   127408228   127408317   7SK
chr10   127511874   127512063   AADAC
chr10   14614140    14614294    AC000111.3
chr10   14695964    14696146    AC000111.6

I would like to create an output file (output.txt) containing the matched terms with their corresponding count.

+ 5S_rRNA   3
+ 7SK   2
+ AADAC 1
+ AC000111.3    1
+ AC000111.6    1

So far, I've produced a list containing all the matched terms using the following script but all attempts to provide a count of the matched terms haven't worked.

    exec < input_list.txt
    while read line
    do
                grep -w data.txt | awk '{print $0}'| sort| uniq  >> grep_output.txt
    done

I have tried grep -o -w | wc -l and grep -w data.txt | wc -l etc but I can't work out how to produce an output list containing the matched term with its corresponding count.

Any suggestions would be great!

perreal · Accepted Answer

You can grep the words from the input.txt and use uniq to get the counts:

cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c

Gives:

  3 5S_rRNA
  2 7SK
  1 AADAC
  1 AC000111.3
  1 AC000111.6

You can also add another sed to get formatted output:

cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c | \
      sed 's/\s*$[0-9]*$\s*$.*$/+ \2\t\1/'

Produces:

+ 5S_rRNA   3
+ 7SK   2
+ AADAC 1
+ AC000111.3    1
+ AC000111.6    1

fedorqui 'SO stop harming' · Answer

awk can be good for this:

$ awk 'NR==FNR {vals[$2]=$2}
       $4 in vals {count[$4]++}
       END {for (i in count) print i, count[i]}' input_list data.txt
AC000111.3 1
AC000111.6 1
5S_rRNA 3
AADAC 1
7SK 2

Explanation

vals[] stores the second field of the input_list file. Then, it checks if the 4th field of the second file data.txt is in any line and counts the occurences in count[] array. Finally it prints the output in the END{} block.

Piping to sort with n (numeric) r (reverse) and k2 (2nd column) options, you get sorted data:

$ awk 'NR==FNR {vals[$2]=$2}
       $4 in vals {count[$4]++}
       END {for (i in count) print i, count[i]}' input_list data.txt | sort -rnk2
5S_rRNA 3
7SK 2
AC000111.6 1
AC000111.3 1
AADAC 1

mpapec · Answer

perl -lane '$s{ $F[3] }++ END{ print "+ $_ $s{$_}" for sort keys %s }' data.txt

Count the number of matched terms in a text file?

Tags:

grep

unix

match

perl

user1879573

3 Answers

perreal

Explanation

fedorqui 'SO stop harming'

mpapec

Recent Activity

Donate For Us

Count the number of matched terms in a text file?

Tags:

grep

unix

match

perl

user1879573

3 Answers

perreal

Explanation

fedorqui 'SO stop harming'

mpapec

Related questions

Recent Activity

Donate For Us