Count lines and group by prefix word

Question

I want to count number of lines in a document and group it by the prefix word. Prefix is a set of alphanumeric characters delimited by first underscore. I don't care much about sorting them but it would be nice to list them descending by number of occurrences.

The file looks like this:

prefix1_data1
prefix1_data2_a
differentPrefix_data3
prefix1_data2_b
differentPrefix_data5
prefix2_data4
differentPrefix_data5

The output should be the following:

prefix1           3
differentPrefix   3
prefix2           1

I already did this in python but I am curious if it is possible to do this more efficient using command line or bash script? uniq command has -c and -w options but the length of prefix may vary.

RomanPerekhrest · Accepted Answer

The solution using combination of sed, sort and uniq commands:

sed -rn 's/^([^_]+)_.*/\1/p' testfile | sort | uniq -c

The output:

3 differentPrefix
3 prefix1
1 prefix2

^([^_]+)_ - matches a sub-string(prefix, containing any characters except _) from the start of the string to the first occurrence of underscore _

oliv · Answer

You could use awk:

awk -F_ '{a[$1]++}END{for(i in a) print i,a[i]}' file

The field separator is set to _.

An array a is filled with all first element, with their associated count.

When the file is parsed the array content is printed

Count lines and group by prefix word

Tags:

linux

bash

unix

command-line

Wojciech K

Video Answer

2 Answers

RomanPerekhrest

oliv

Recent Activity

Donate For Us

Count lines and group by prefix word

Tags:

linux

bash

unix

command-line

Wojciech K

Video Answer

2 Answers

RomanPerekhrest

oliv

Related questions

Recent Activity

Donate For Us