I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).
The content of the CSV file looks similar to this:
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic
. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.
Currently I have this loop:
while IFS=, read province provinceCode criteriaId countryCode country
do
echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input
which produces this output for the sample data given above:
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]
As you can see, the third entry is parsed incorrectly. I want it to output
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):
awk '{
for (i = 0; ++i <= NF;) {
substr($i, 1, 1) == "\"" &&
$i = substr($i, 2, length($i) - 2)
printf "[%s]%s", $i, (i < NF ? OFS : RS)
}
}' FPAT='([^,]+)|("[^"]+")' infile
Sample output:
% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{
for (i = 0; ++i <= NF;) {
substr($i, 1, 1) == "\"" &&
$i = substr($i, 2, length($i) - 2)
printf "[%s]%s", $i, (i < NF ? OFS : RS)
}
}' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]
With Perl:
perl -MText::ParseWords -lne'
print join " ", map "[$_]",
parse_line(",",0, $_);
' infile
This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).
awk '{
n = parse_csv($0, data)
for (i = 0; ++i <= n;) {
gsub(/,/, " ", data[i])
printf "[%s]%s", data[i], (i < n ? OFS : RS)
}
}
function parse_csv(str, array, field, i) {
split( "", array )
str = str ","
while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) {
field = substr(str, 1, RLENGTH)
gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
gsub(/""/, "\"", field)
array[++i] = field
str = substr(str, RLENGTH + 1)
}
return i
}' infile
After looking at @Dimitre's solution over here. You can do something like this -
#!/usr/local/bin/gawk -f
BEGIN {
FS=","
FPAT="([^,]+)|(\"[^\"]+\")"
}
{
for (i=1;i<=NF;i++)
printf ("[%s] ",$i);
print ""
}
[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
[jaypal:~/Temp] ./script.awk filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation]
For removing "
you can pipe the output to sed
.
[jaypal:~/Temp] ./script.awk filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]
After thinking about the problem, I realised that since the comma in the string isn't important to me, it'd be easier to simply remove it from the input before parsing.
To that end, I've concocted a sed
command that matches strings surrounded by doubled quotes that contain a comma. The command then removes the bits you don't want from the matched string. It does this by separating the regex into remembered sections.
This solution only works where the string contains a single comma between double quotes.
The unescaped regex is
(")(.*)(,)(.*)(")
The first, third, and fifth pairs of parentheses capture the opening double quote, comma, and closing double quote respectively.
The second and third pairs of parentheses capture the actual content of the field which we want to keep.
sed
Command To Remove Comma:
echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\1\2\3\4/'
sed
Command To Remove Comma and Double Quotes:
echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\3/'
Updated Code:
tmpFile=$input"Temp"
sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\4/' < $input > $tmpFile
while IFS=, read province provinceCode criteriaId countryCode country
do
echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $tmpFile
rm $tmpFile
Output:
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
[Bío-Bío] [CL-BI] [20154] [CL] [Chile]
If you can tolerate having the surrounding quotes persist in the output, you can use a small script I wrote called csvquote to enable awk and cut (and other UNIX text tools) to properly handle quoted fields that contain commas. You wrap the command like this:
csvquote inputfile.csv | awk -F, '{print "["$1"] ["$2"] ["$3"] ["$4"] ["$5"]"}' | csvquote -u
see https://github.com/dbro/csvquote for the code and documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With