I want to count number of lines in a document and group it by the prefix word. Prefix is a set of alphanumeric characters delimited by first underscore. I don't care much about sorting them but it would be nice to list them descending by number of occurrences.
The file looks like this:
prefix1_data1
prefix1_data2_a
differentPrefix_data3
prefix1_data2_b
differentPrefix_data5
prefix2_data4
differentPrefix_data5
The output should be the following:
prefix1 3
differentPrefix 3
prefix2 1
I already did this in python but I am curious if it is possible to do this more efficient using command line or bash script? uniq
command has -c
and -w
options but the length of prefix may vary.
The solution using combination of sed
, sort
and uniq
commands:
sed -rn 's/^([^_]+)_.*/\1/p' testfile | sort | uniq -c
The output:
3 differentPrefix
3 prefix1
1 prefix2
^([^_]+)_
- matches a sub-string(prefix, containing any characters except _
) from the start of the string to the first occurrence of underscore _
You could use awk
:
awk -F_ '{a[$1]++}END{for(i in a) print i,a[i]}' file
The field separator is set to _
.
An array a
is filled with all first element, with their associated count.
When the file is parsed the array content is printed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With