Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use sed to replace text just in quotes

Tags:

regex

sed

I have this test file.

[root@localhost ~]# cat f.txt 
"a aa"  MM  "bbb  b"
MM    MM
MM"b b "
[root@localhost ~]#

I want to replace all space characters in the quotes, note, just in the quotes. All characters out of the quotes should not be touched. That is to say, what I want is something similar to:

"a_aa"  MM  "bbb__b"
MM    MM
MM"b_b_"

Can this be implemented using sed?

Thanks,

like image 728
Just a learner Avatar asked Nov 25 '11 07:11

Just a learner


1 Answers

This is an entirely non-trivial question.

This works replacing the first space inside quotes with underscore:

$ sed 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa"  MM  "bbb_ b"
MM    MM
MM"b_b "
$

For this example, where there are no more than two spaces inside any of the quotes, it is tempting to simply repeat the command, but it gives an incorrect result:

$ sed -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' \
>     -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa"_ MM  "bbb_ b"
MM    MM
MM"b_b_"
$

If your version of sed supports 'extended regular expressions', then this works for the sample data:

$ sed -E \
>    -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
>    -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
>    -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
>    f.txt
"a_aa"  MM  "bbb__b"
MM    MM
MM"b_b_"
$

You have to repeat that ghastly regex for every space within double quotes - hence three times for the first line of data.

The regex can be explained as:

  • Starting at the beginning of a line,
  • Look for sequences of 'zero or more non-quotes, optionally followed by a quote, no spaces or quotes, and a quote', the whole assembly repeated zero or more times,
  • Followed by a quote, zero or more non-quotes, non-spaces, a space, and zero or more non-quotes, and a quote.
  • Replace the matched material with the leading part, the material at the start of the current quoted passage, an underscore, and the trailing material of the current quoted passage.

Because of the start anchor, this has to be repeated once per blank...but sed has a looping construct, so we can do it with:

$ sed -E -e ':redo
>            s/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/
>            t redo' f.txt
"a_aa"  MM  "bbb__b"
MM    MM
MM"b_b_"
$

The :redo defines a label; the s/// command is as before; the t redo command jumps to the label if there was any substitution done since the last read of a line or jump to a label.


Given the discussion in the comments, there are a couple of points worth mentioning:

  1. The -E option applies to sed on MacOS X (tested 10.7.2). The corresponding option for the GNU version of sed is -r (or --regex-extended). The -E option is consistent with grep -E (which also uses extended regular expressions). The 'classic Unix systems' do not support EREs with sed (Solaris 10, AIX 6, HP-UX 11).

  2. You can replace the ? I used (which is the only character that forces the use of an ERE instead of a BRE) with *, and then deal with the parentheses (which require backslashes in front of them in a BRE to make them into capturing parentheses), leaving the script:

    sed -e ':redo
            s/^\(\([^"]*\("[^ "]*"\)*\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g
            t redo' f.txt
    

    This produces the same output on the same input - I tried some slightly more complex patterns in the input:

    "a aa"  MM  "bbb  b"
    MM    MM
    MM"b b "
    "c c""d d""e  e" X " f "" g "
     "C C" "D D" "E  E" x " F " " G "
    

    This gives the output:

    "a_aa"  MM  "bbb__b"
    MM    MM
    MM"b_b_"
    "c_c""d_d""e__e" X "_f_""_g_"
     "C_C" "D_D" "E__E" x "_F_" "_G_"
    
  3. Even with BRE notation, sed supported the \{0,1\} notation to specify 0 or 1 occurrences of the previous RE term, so the ? version could be translated to a BRE using:

    sed -e ':redo
            s/^\(\([^"]*\("[^ "]*"\)\{0,1\}\)*\)\("[^ "]*\) \([^"]*"\)/\1\4_\5/g
            t redo' f.txt
    

    This produces the same output as the other alternatives.

like image 100
Jonathan Leffler Avatar answered Nov 15 '22 15:11

Jonathan Leffler