I have a file with a large number of similar strings. I want to count unique occurrences of a regex, and also show what they were, e.g. for the pattern <code>Profile: (\w*)</code> on the file: <pre class="prettyprint"><code>Profile: blah Profile: another Profile: trees Profile: blah </code></pre> I want to find that there are 3 occurrences, and return the results: <pre class="prettyprint"><code>blah, another, trees </code></pre>

Try this: <pre class="prettyprint"><code>egrep "Profile: (\w*)" test.text -o | sed 's/Profile: $\w*$/\1/g' | sort | uniq </code></pre> Output: <pre class="prettyprint"><code>another blah trees </code></pre> Description <code>egrep</code> with <code>-o</code> option will fetch matching pattern within a file. <code>sed</code> will only fetch capturing part <code>sort</code> followed by <code>uniq</code> will give a list of unique elements To get number of elements in resultant list, append the command with <code>wc -l</code> <pre class="prettyprint"><code>egrep "Profile: (\w*)" test.text -o | sed 's/Profile: $\w*$/\1/g' | sort | uniq | wc -l </code></pre> Output: <pre class="prettyprint"><code>3 </code></pre>

<pre class="prettyprint"><code>awk '{a[$2]}END{for(x in a)print x}' file </code></pre> will work on your example <pre class="prettyprint"><code>kent$ echo "Profile: blah Profile: another Profile: trees Profile: blah"|awk '{a[$2]}END{for(x in a)print x}' another trees blah </code></pre> if you want to have the count (3) in output: <pre class="prettyprint"><code>awk '{a[$2]}END{print "count:",length(a);for(x in a)print x }' file </code></pre> with same example: <pre class="prettyprint"><code>kent$ echo "Profile: blah Profile: another Profile: trees Profile: blah"|awk '{a[$2]}END{print "count:",length(a);for(x in a)print x }' count: 3 another trees blah </code></pre>

Linux tools - how to count and list occurrences of regex in file

Tags:

regex

linux

I have a file with a large number of similar strings. I want to count unique occurrences of a regex, and also show what they were, e.g. for the pattern Profile: (\w*) on the file:

Profile: blah
Profile: another
Profile: trees
Profile: blah

I want to find that there are 3 occurrences, and return the results:

blah, another, trees

645

asked Sep 25 '13 14:09

Stefan

2 Answers

Try this:

egrep "Profile: (\w*)" test.text -o | sed 's/Profile: \(\w*\)/\1/g' | sort | uniq

Output:

another
blah
trees

Description

egrep with -o option will fetch matching pattern within a file.

sed will only fetch capturing part

sort followed by uniq will give a list of unique elements

To get number of elements in resultant list, append the command with wc -l

egrep "Profile: (\w*)" test.text -o | sed 's/Profile: \(\w*\)/\1/g' | sort | uniq | wc -l

Output:

161

answered Nov 05 '22 15:11

jkshah

awk '{a[$2]}END{for(x in a)print x}' file

will work on your example

kent$  echo "Profile: blah
Profile: another
Profile: trees
Profile: blah"|awk '{a[$2]}END{for(x in a)print x}'
another
trees
blah

if you want to have the count (3) in output:

awk '{a[$2]}END{print "count:",length(a);for(x in a)print x }' file

with same example:

kent$  echo "Profile: blah
Profile: another
Profile: trees
Profile: blah"|awk '{a[$2]}END{print "count:",length(a);for(x in a)print x }'
count: 3
another
trees
blah

answered Nov 05 '22 14:11

Kent

Related questions
                            
                                Combine Multiple Regexp Patterns
                            
                                How to remove HTML markup from a body of text within a Google Spreadsheet?
                            
                                Java regular expression to validate numeric comma separated values
                            
                                Different MAC Addresses Regex
                            
                                Replace/delete special characters within matched strings in sed
                            
                                Tidy up a string
                            
                                PHP: How to keep line-breaks using nl2br() with HTML Purifier?
                            
                                sed - Include newline in pattern
                            
                                Python tokenize sentence with optional key/val pairs
                            
                                Check if a string is a valid RegEx Pattern VB.NET
                            
                                Why does the order of alternatives matter in regex?
                            
                                Find all lines with a length greater than N
                            
                                regex - confused about lookaround functionality
                            
                                Can you explain why \G in my Perl regex pattern behaves this way?
                            
                                Extracting string between quotes split across multiple lines in Python
                            
                                Extract using sed or grep
                            
                                C++ can't find regex even with -std=c++11 macOSX
                            
                                correct usage of carets inside negative lookahead expression in perl
                            
                                DataAnnotaion fails(freeze) on client?
                            
                                jquery replace square brackets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With