Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace every comma not enclosed in a pair of double quotes with '|' [closed]

Tags:

sed

awk

I want to replace every comma not enclosed in a pair of double quotes with '|' using awk pattern matching within sed.

For example, given the following input:

A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M

This is the desired output:

A|B|"C,D"|E|"F,G"|H|"I,J,K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M

I want this accomplished without using hard coding like:

sed '2s/33,MIS/33|MIS|/' file.
like image 275
loganaayahee Avatar asked Nov 20 '12 14:11

loganaayahee


1 Answers

First sample: Quick and dirty:

If your comma are ALWAY followed by a space in text strings, and NEVER in field separation, you could use:

sed -e 's/,\([^ ]\)/\|\1/g'
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M

but you have to be sure about next character.

More elaborated sample, whithout need of space, closest to your original idea.

sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta'

echo '"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M' |
  sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta'
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M

echo '"Chang,Yao-Jen",33,MIS,"Taiwan,Taipei",M' |
  sed -e '1 { :a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta }'
"Chang,Yao-Jen"|33|MIS|"Taiwan,Taipei"|M

Explained:

sed -e '
    :a
    s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/
    ta
'
  • :a is a address location for branch (loop)
  • s/ search from begin of line for '[^",]*,' or '"...",' than replace comma by vbar.
  • ta branch to a if previous s/ have been matched.

As you requested to operate on line 2, you will have to:

sed -e '2 { :a; s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/; ta } '

Edited: [WRONG! See Edit 3]

Another sample if you wanna have mixed quotes and double-quotes:

There is a sample with mixed quoted, non-quoted, and one field containing a quote, but double quoted:

cat <<eof >sample
A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
A,B,'C,D',E,'F,G',H,'I,J,K'
'Chang, Yao-Jen',33,MIS,'Taiwan, Taipei',M
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M,'Chang,Yao-Jen',34,MZZ,'Taiwan, Taipei',Z
"Chang's son: Yao-Lu",55,MAA,'Taiwan, too',z
eof

sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;ta' sample
A|B|"C,D"|E|"F,G"|H|"I,J,K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
A|B|'C,D'|E|'F,G'|H|'I,J,K'
'Chang, Yao-Jen'|33|MIS|'Taiwan, Taipei'|M
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M|'Chang,Yao-Jen'|34|MZZ|'Taiwan, Taipei'|Z
"Chang's son: Yao-Lu"|55|MAA|'Taiwan, too'|z

where sed script could by confined in a little more readable script file as:

cat <<oesedscript >csvtopsv.sed 
#!/bin/sed -f 
# Coma Separated Values to Pipe Separated Values
:a
s/^\(\(\(['"]\)[^\3]*\3\|[^",']*\)*\),/\1|/;
ta
oesedscript
chmod +x csvtopsv.sed

./csvtopsv.sed sample
A|B|"C,D"|E|"F|G"|H|"I|J|K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
A|B|'C,D'|E|'F|G'|H|'I|J|K'
'Chang, Yao-Jen'|33|MIS|'Taiwan, Taipei'|M
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M|'Chang,Yao-Jen'|34|MZZ|'Taiwan, Taipei'|Z
"Chang's son: Yao-Lu"|55|MAA|'Taiwan, too'|z

Explained:

the s/ search for quote or double quote ['"] as third enclosed regex part, followed by 0 or more other character than mathing third enclosed part, finaly followed by a second character same as third regex part... OR no coma, single nor double quote [,'"]...

Edit 3 Warn! this was Wrong!:

So right answer seem definitively be something like:

sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;ta'

you could see my error in adding a ;L for debugging, before ta:

sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;L;ta'

where

echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
  sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1#/;L;ta'
1#"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh",236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55

we can see that this is not as simple... [^\3] dont give intended effect, but instead match for not char 3.

Finaly, we have to search for each delimitor for himself:

:a;
s/^\(\("[^"]*"\|'[^']*'\|[^",']*\)*\),/\1\t/;
ta

Nota: from there, I will present csv2tsv as coma to tab separated values, if you really prefer to use | pipe as separator, you could replace \t by | or any char you want.

well commmand line is less sexy:

echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
  sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1\t/;L;ta' 
1       "John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh",236,"B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh"   236,"B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55
1   "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55

But this match the need.

echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
  sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1\t/;ta' 
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55

In fine, creating sedscript:

cat >csv2tsv.sed <<eof
#!/bin/sed -f
# Coma separated values to Tab separated values

:a
s/^\(\("[^"]*"\|'[^']*'\|[^",']*\)*\),/\1\t/;
ta
eof

chmod +x csv2tsv.sed

Now:

cat >file.csv <<eof
A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
4,"hacker's string",'one quote: "I have no special talents. I am only passionat\
ely curious." - Albert Einstein',unquoted string,9,1,1,3
eof

./csv2tsv.sed file.csv 
A   B       "C,D"   E      "F,G"    H    "I,J,K"
"Chang, Yao-Jen"    33     MIS      "Taiwan, Taipei"        M
1   "John Doe"      "6, rue Peuh"   236  "B,-,F,H,P,-"      -55
4   "hacker's string"      'one quote: "I have no special talents. I am only pa
ssionately curious." - Albert Einstein' unquoted string 9  1    1       3
like image 142
F. Hauri Avatar answered Oct 02 '22 23:10

F. Hauri