I am trying to count the number of matched terms from an input list containing one term per line with a data file and create an output file containing the matched (grep'd) term with the number of matches.
The input_list.txt looks like this:
+ 5S_rRNA
+ 7SK
+ AADAC
+ AC000111.3
+ AC000111.6
The data.txt file:
chr10 101780038 101780209 5S_rRNA
chr10 103578280 103578430 5S_rRNA
chr10 112327234 112327297 5S_rRNA
chr10 120766459 120766601 7SK
chr10 127408228 127408317 7SK
chr10 127511874 127512063 AADAC
chr10 14614140 14614294 AC000111.3
chr10 14695964 14696146 AC000111.6
I would like to create an output file (output.txt) containing the matched terms with their corresponding count.
+ 5S_rRNA 3
+ 7SK 2
+ AADAC 1
+ AC000111.3 1
+ AC000111.6 1
So far, I've produced a list containing all the matched terms using the following script but all attempts to provide a count of the matched terms haven't worked.
exec < input_list.txt
while read line
do
grep -w data.txt | awk '{print $0}'| sort| uniq >> grep_output.txt
done
I have tried grep -o -w | wc -l and grep -w data.txt | wc -l
etc but I can't work out how to produce an output list containing the matched term with its corresponding count.
Any suggestions would be great!
You can grep the words from the input.txt and use uniq to get the counts:
cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c
Gives:
3 5S_rRNA
2 7SK
1 AADAC
1 AC000111.3
1 AC000111.6
You can also add another sed to get formatted output:
cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c | \
sed 's/\s*\([0-9]*\)\s*\(.*\)/+ \2\t\1/'
Produces:
+ 5S_rRNA 3
+ 7SK 2
+ AADAC 1
+ AC000111.3 1
+ AC000111.6 1
awk
can be good for this:
$ awk 'NR==FNR {vals[$2]=$2}
$4 in vals {count[$4]++}
END {for (i in count) print i, count[i]}' input_list data.txt
AC000111.3 1
AC000111.6 1
5S_rRNA 3
AADAC 1
7SK 2
vals[]
stores the second field of the input_list
file. Then, it checks if the 4th field of the second file data.txt
is in any line and counts the occurences in count[]
array. Finally it prints the output in the END{}
block.
Piping to sort
with n
(numeric) r
(reverse) and k2
(2nd column) options, you get sorted data:
$ awk 'NR==FNR {vals[$2]=$2}
$4 in vals {count[$4]++}
END {for (i in count) print i, count[i]}' input_list data.txt | sort -rnk2
5S_rRNA 3
7SK 2
AC000111.6 1
AC000111.3 1
AADAC 1
perl -lane '$s{ $F[3] }++ END{ print "+ $_ $s{$_}" for sort keys %s }' data.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With