I want to replace every comma not enclosed in a pair of double quotes with '|' using awk pattern matching within sed.
For example, given the following input:
A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
This is the desired output:
A|B|"C,D"|E|"F,G"|H|"I,J,K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
I want this accomplished without using hard coding like:
sed '2s/33,MIS/33|MIS|/' file.
First sample: Quick and dirty:
If your comma are ALWAY followed by a space in text strings, and NEVER in field separation, you could use:
sed -e 's/,\([^ ]\)/\|\1/g'
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
but you have to be sure about next character.
More elaborated sample, whithout need of space, closest to your original idea.
sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta'
echo '"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M' |
sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta'
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
echo '"Chang,Yao-Jen",33,MIS,"Taiwan,Taipei",M' |
sed -e '1 { :a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta }'
"Chang,Yao-Jen"|33|MIS|"Taiwan,Taipei"|M
Explained:
sed -e '
:a
s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/
ta
'
:a
is a address location for branch (loop)s/
search from begin of line for '[^",]*,' or '"...",' than replace comma by vbar.ta
branch to a if previous s/
have been matched.As you requested to operate on line 2, you will have to:
sed -e '2 { :a; s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/; ta } '
Edited: [WRONG! See Edit 3]
Another sample if you wanna have mixed quotes and double-quotes:
There is a sample with mixed quoted, non-quoted, and one field containing a quote, but double quoted:
cat <<eof >sample
A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
A,B,'C,D',E,'F,G',H,'I,J,K'
'Chang, Yao-Jen',33,MIS,'Taiwan, Taipei',M
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M,'Chang,Yao-Jen',34,MZZ,'Taiwan, Taipei',Z
"Chang's son: Yao-Lu",55,MAA,'Taiwan, too',z
eof
sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;ta' sample
A|B|"C,D"|E|"F,G"|H|"I,J,K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
A|B|'C,D'|E|'F,G'|H|'I,J,K'
'Chang, Yao-Jen'|33|MIS|'Taiwan, Taipei'|M
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M|'Chang,Yao-Jen'|34|MZZ|'Taiwan, Taipei'|Z
"Chang's son: Yao-Lu"|55|MAA|'Taiwan, too'|z
where sed script could by confined in a little more readable script file as:
cat <<oesedscript >csvtopsv.sed
#!/bin/sed -f
# Coma Separated Values to Pipe Separated Values
:a
s/^\(\(\(['"]\)[^\3]*\3\|[^",']*\)*\),/\1|/;
ta
oesedscript
chmod +x csvtopsv.sed
./csvtopsv.sed sample
A|B|"C,D"|E|"F|G"|H|"I|J|K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
A|B|'C,D'|E|'F|G'|H|'I|J|K'
'Chang, Yao-Jen'|33|MIS|'Taiwan, Taipei'|M
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M|'Chang,Yao-Jen'|34|MZZ|'Taiwan, Taipei'|Z
"Chang's son: Yao-Lu"|55|MAA|'Taiwan, too'|z
Explained:
the s/
search for quote or double quote ['"]
as third enclosed regex part, followed by 0 or more other character than mathing third enclosed part, finaly followed by a second character same as third regex part... OR no coma, single nor double quote [,'"]
...
Edit 3 Warn! this was Wrong!:
So right answer seem definitively be something like:
sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;ta'
you could see my error in adding a ;L
for debugging, before ta
:
sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;L;ta'
where
echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1#/;L;ta'
1#"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh",236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55
we can see that this is not as simple... [^\3]
dont give intended effect, but instead match for not char 3
.
Finaly, we have to search for each delimitor for himself:
:a;
s/^\(\("[^"]*"\|'[^']*'\|[^",']*\)*\),/\1\t/;
ta
Nota: from there, I will present csv2tsv
as coma to tab separated values, if you really prefer to use |
pipe as separator, you could replace \t
by |
or any char you want.
well commmand line is less sexy:
echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1\t/;L;ta'
1 "John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
1 "John Doe" "6, rue Peuh",236,"B,-,F,H,P,-",-55
1 "John Doe" "6, rue Peuh" 236,"B,-,F,H,P,-",-55
1 "John Doe" "6, rue Peuh" 236 "B,-,F,H,P,-",-55
1 "John Doe" "6, rue Peuh" 236 "B,-,F,H,P,-" -55
1 "John Doe" "6, rue Peuh" 236 "B,-,F,H,P,-" -55
1 "John Doe" "6, rue Peuh" 236 "B,-,F,H,P,-" -55
But this match the need.
echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1\t/;ta'
1 "John Doe" "6, rue Peuh" 236 "B,-,F,H,P,-" -55
In fine, creating sedscript:
cat >csv2tsv.sed <<eof
#!/bin/sed -f
# Coma separated values to Tab separated values
:a
s/^\(\("[^"]*"\|'[^']*'\|[^",']*\)*\),/\1\t/;
ta
eof
chmod +x csv2tsv.sed
Now:
cat >file.csv <<eof
A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
4,"hacker's string",'one quote: "I have no special talents. I am only passionat\
ely curious." - Albert Einstein',unquoted string,9,1,1,3
eof
./csv2tsv.sed file.csv
A B "C,D" E "F,G" H "I,J,K"
"Chang, Yao-Jen" 33 MIS "Taiwan, Taipei" M
1 "John Doe" "6, rue Peuh" 236 "B,-,F,H,P,-" -55
4 "hacker's string" 'one quote: "I have no special talents. I am only pa
ssionately curious." - Albert Einstein' unquoted string 9 1 1 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With