Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a .csv-like file in bash

I have a file formatted as follows:

string1,string2,string3,...
...

I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:

"number of occurrences of x",x
"number of occurrences of y",y        
...

I managed to write the following script, that works fine:

#!/bin/bash

> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
    if [[ "$line" =~ $regExp ]]
    then
        printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
    fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"

My question is: There is a better and simpler way to do the job?

In particular I don't know how to fix that:

gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'

The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string. Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.

Thank very much, Goodbye

EDIT:

As asked, here there is some sample data:

(It is an exercise, sorry for the inventive)

Input:

*,*,*
test,  test  ,test
prova, * , prova
test,test,test
prova,  prova   ,prova
leonardo,da vinci,leonardo
in,o    u   t   ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o    u   t   ,pr
test,  test  ,test
,   tabs    ,
,   tabs    ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
,   tabs    ,

Output:

3, * 
4,*
4,da vinci
2,o u   t   
3,po
1,  prova   
3, spaces 
3,  tabs    
1,test
2,  test  
like image 440
Luca Avatar asked Sep 08 '15 18:09

Luca


Video Answer


2 Answers

A one-liner in awk:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv

It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.

To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2

The only condition, of course, is that the 2nd column of each line doesn't contain a ,

like image 54
Filipe Gonçalves Avatar answered Sep 20 '22 04:09

Filipe Gonçalves


You can make your final awk:

gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'

or use sed for this sort of thing:

sed 's/ *\([0-9]*\) /\1,/'
like image 20
meuh Avatar answered Sep 20 '22 04:09

meuh