I have this test file.
[root@localhost ~]# cat f.txt "a aa" MM "bbb b" MM MM MM"b b " [root@localhost ~]#
I want to replace all space characters in the quotes, note, just in the quotes. All characters out of the quotes should not be touched. That is to say, what I want is something similar to:
"a_aa" MM "bbb__b" MM MM MM"b_b_"
Can this be implemented using sed?
Thanks,
This is an entirely non-trivial question.
This works replacing the first space inside quotes with underscore:
$ sed 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa" MM "bbb_ b"
MM MM
MM"b_b "
$
For this example, where there are no more than two spaces inside any of the quotes, it is tempting to simply repeat the command, but it gives an incorrect result:
$ sed -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' \
> -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa"_ MM "bbb_ b"
MM MM
MM"b_b_"
$
If your version of sed
supports 'extended regular expressions', then this works for the sample data:
$ sed -E \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
> f.txt
"a_aa" MM "bbb__b"
MM MM
MM"b_b_"
$
You have to repeat that ghastly regex for every space within double quotes - hence three times for the first line of data.
The regex can be explained as:
Because of the start anchor, this has to be repeated once per blank...but sed
has a looping construct, so we can do it with:
$ sed -E -e ':redo
> s/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/
> t redo' f.txt
"a_aa" MM "bbb__b"
MM MM
MM"b_b_"
$
The :redo
defines a label; the s///
command is as before; the t redo
command jumps to the label if there was any substitution done since the last read of a line or jump to a label.
Given the discussion in the comments, there are a couple of points worth mentioning:
The -E
option applies to sed
on MacOS X (tested 10.7.2). The corresponding option for the GNU version of sed
is -r
(or --regex-extended
). The -E
option is consistent with grep -E
(which also uses extended regular expressions). The 'classic Unix systems' do not support EREs with sed
(Solaris 10, AIX 6, HP-UX 11).
You can replace the ?
I used (which is the only character that forces the use of an ERE instead of a BRE) with *
, and then deal with the parentheses (which require backslashes in front of them in a BRE to make them into capturing parentheses), leaving the script:
sed -e ':redo
s/^\(\([^"]*\("[^ "]*"\)*\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g
t redo' f.txt
This produces the same output on the same input - I tried some slightly more complex patterns in the input:
"a aa" MM "bbb b"
MM MM
MM"b b "
"c c""d d""e e" X " f "" g "
"C C" "D D" "E E" x " F " " G "
This gives the output:
"a_aa" MM "bbb__b"
MM MM
MM"b_b_"
"c_c""d_d""e__e" X "_f_""_g_"
"C_C" "D_D" "E__E" x "_F_" "_G_"
Even with BRE notation, sed
supported the \{0,1\}
notation to specify 0 or 1 occurrences of the previous RE term, so the ?
version could be translated to a BRE using:
sed -e ':redo
s/^\(\([^"]*\("[^ "]*"\)\{0,1\}\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g
t redo' f.txt
This produces the same output as the other alternatives.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With