Parsing a .csv-like file in bash

Question

I have a file formatted as follows:

string1,string2,string3,...
...

I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:

"number of occurrences of x",x
"number of occurrences of y",y        
...

I managed to write the following script, that works fine:

#!/bin/bash

> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
    if [[ "$line" =~ $regExp ]]
    then
        printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}
" >> output
    fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"

My question is: There is a better and simpler way to do the job?

In particular I don't know how to fix that:

gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'

The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string. Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.

Thank very much, Goodbye

EDIT:

As asked, here there is some sample data:

(It is an exercise, sorry for the inventive)

Input:

*,*,*
test,  test  ,test
prova, * , prova
test,test,test
prova,  prova   ,prova
leonardo,da vinci,leonardo
in,o    u   t   ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o    u   t   ,pr
test,  test  ,test
,   tabs    ,
,   tabs    ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
,   tabs    ,

Output:

3, * 
4,*
4,da vinci
2,o u   t   
3,po
1,  prova   
3, spaces 
3,  tabs    
1,test
2,  test

Filipe Gonçalves · Accepted Answer

A one-liner in awk:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv

It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.

To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2

The only condition, of course, is that the 2nd column of each line doesn't contain a ,

meuh · Answer

You can make your final awk:

gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'

or use sed for this sort of thing:

sed 's/ *$[0-9]*$ /\1,/'

Parsing a .csv-like file in bash

Tags:

regex

bash

csv

awk

gawk

Luca

Video Answer

2 Answers

Filipe Gonçalves

meuh

Recent Activity

Donate For Us

Parsing a .csv-like file in bash

Tags:

regex

bash

csv

awk

gawk

Luca

Video Answer

2 Answers

Filipe Gonçalves

meuh

Related questions

Recent Activity

Donate For Us