<p>I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.</p> <p>The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".</p> <p>If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.</p> <p>I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.</p> <p>Could anyone explain what is happening and suggest a sed command that would work or do I need to loop. </p> <p>I know I could probably switch to awk, perl, python but I want to know what is happening in sed.</p>

<p>As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.</p> <pre class="prettyprint"><code>sed -e 's/\t/\t\\N/g' -e 's/\\N$[^\t]$/\1/g' </code></pre> <p>... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)</p>

<p>Not dissimilar to the perl solution, this works for me using pure sed</p> <p>With @Robin A. Meade improvement</p> <pre class="prettyprint"><code>sed ':repeat; s|\t\t|\t\n\t|g; t repeat' </code></pre> <h3>Explanation</h3> <ul> <li> <code>:repeat</code> is a label, used for branch commands, similar to batch</li> <li> <code>s|\t\t|\t\n\t|g;</code> - Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.</li> <li> <code>t repeat</code> means if the "s" command did any replaces, then goto the label <code>repeat</code>, else it goes onto the next line and starts over again.</li> </ul> <p>So it goes like this. Keep repeating (goto <code>repeat</code>) as long as there is a match for the pattern of 2 tabs.</p> <p>While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.</p> <p>As @thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.</p> <p><strong>Original Answer</strong></p> <pre class="prettyprint"><code>sed ':repeat; /\t\t/{ s|\t\t|\t\n\t|g; b repeat }' </code></pre> <h3>Explanation</h3> <ul> <li> <code>:repeat</code> is a label, used for branch commands, similar to batch</li> <li> <code>/\t\t/</code> means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.</li> <li> <code>{}</code> - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.</li> <li> <code>s|\t\t|\t\n\t|g;</code> - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.</li> <li> <code>b repeat</code> means always goto (branch) the label <code>repeat</code> </li> </ul> <h3>Short version</h3> <p>Which can be shortened to</p> <pre class="prettyprint"><code>sed ':r;s|\t\t|\t\n\t|g; t r' # Original answer # sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}' </code></pre> <hr> <h3>MacOS</h3> <p>And the Mac (yet still Linux/Windows compatible) version:</p> <pre class="prettyprint"><code>sed $':r\ns|\t\t|\t\\\n\t|g; t r' # Original answer # sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}' </code></pre> <ul> <li>Tabs need to be literal in BSD sed</li> <li>Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline</li> <li>Both label names (:r) and branch commands (b r when not the end of the expression) must end in a newline. Special characters like semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.</li> </ul>

Why does sed not replace overlapping patterns

Tags:

shell

unix

sed

I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.

The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".

If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.

I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.

Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.

I know I could probably switch to awk, perl, python but I want to know what is happening in sed.

832

asked Sep 14 '11 18:09

hairyone

3 Answers

I know you want sed, but sed doesn't like this at all, it seems that it specifically (see here) won't do what you want. However, perl will do it (AFAIK):

perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename>

answered Oct 17 '22 12:10

KevinDTimm

As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.

sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'

... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)

answered Oct 17 '22 12:10

tripleee

Not dissimilar to the perl solution, this works for me using pure sed

With @Robin A. Meade improvement

sed ':repeat;
     s|\t\t|\t\n\t|g;
     t repeat'

Explanation

:repeat is a label, used for branch commands, similar to batch
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.
t repeat means if the "s" command did any replaces, then goto the label repeat, else it goes onto the next line and starts over again.

So it goes like this. Keep repeating (goto repeat) as long as there is a match for the pattern of 2 tabs.

While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.

As @thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.

Original Answer

sed ':repeat;
     /\t\t/{
       s|\t\t|\t\n\t|g;
       b repeat
     }'

Explanation

:repeat is a label, used for branch commands, similar to batch
/\t\t/ means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.
{} - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.
b repeat means always goto (branch) the label repeat

Short version

Which can be shortened to

sed ':r;s|\t\t|\t\n\t|g; t r'

# Original answer
# sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'

MacOS

And the Mac (yet still Linux/Windows compatible) version:

sed $':r\ns|\t\t|\t\\\n\t|g; t r'

# Original answer
# sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'

Tabs need to be literal in BSD sed
Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline
Both label names (:r) and branch commands (b r when not the end of the expression) must end in a newline. Special characters like semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.

answered Oct 17 '22 14:10

Andy

Related questions
                            
                                Bash insert subnode to XML file
                            
                                What does grep -Po '...\K...' do? How else can that effect be achieved?
                            
                                Jenkins save shell output to var
                            
                                Show non-deletable text before input text in JavaFX text field
                            
                                apt-get: How to bypass pressing ENTER
                            
                                Concatenate multiple columns into one in hive
                            
                                What does triple-single-quote mean in bash?
                            
                                Why does /usr/bin/timeout kill the entire pipe?
                            
                                Shell equivalent of php preg_match?
                            
                                pipe the output of a command into less or into cat depending on length
                            
                                How can my shell script control the placement of a zenity window?
                            
                                how to grep a variable in the shell program? [duplicate]
                            
                                running BLAST (bl2seq) without creating sequence files
                            
                                Shell commands to match key value pairs
                            
                                How to access GMail (IMAP Email) from my Shell/Python script to download a zip file attached to an email and process it?
                            
                                shell scripting arithmetic operations
                            
                                Best way in the shell to do basic statistics?
                            
                                Using AppActivate and Sendkeys in VBA shell command
                            
                                Check whether socket is closed in bash?
                            
                                clojure -- how to run a program without piping it's output?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does sed not replace overlapping patterns

Tags:

shell

unix

sed

hairyone

People also ask

3 Answers

KevinDTimm

tripleee

Explanation

Explanation

Short version

MacOS

Andy

Recent Activity

Donate For Us