I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.
The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".
If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.
I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.
Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.
I know I could probably switch to awk, perl, python but I want to know what is happening in sed.
By default sed does not overwrite the original file; it writes to stdout (hence the result can be redirected using the shell operator > as you showed).
Substitution command In some versions of sed, the expression must be preceded by -e to indicate that an expression follows. The s stands for substitute, while the g stands for global, which means that all matching occurrences in the line would be replaced.
I know you want sed, but sed doesn't like this at all, it seems that it specifically (see here) won't do what you want. However, perl will do it (AFAIK):
perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename>
As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.
sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'
... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)
Not dissimilar to the perl solution, this works for me using pure sed
With @Robin A. Meade improvement
sed ':repeat;
s|\t\t|\t\n\t|g;
t repeat'
:repeat
is a label, used for branch commands, similar to batchs|\t\t|\t\n\t|g;
- Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.t repeat
means if the "s" command did any replaces, then goto the label repeat
, else it goes onto the next line and starts over again.So it goes like this. Keep repeating (goto repeat
) as long as there is a match for the pattern of 2 tabs.
While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.
As @thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.
Original Answer
sed ':repeat;
/\t\t/{
s|\t\t|\t\n\t|g;
b repeat
}'
:repeat
is a label, used for branch commands, similar to batch/\t\t/
means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.{}
- In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.s|\t\t|\t\n\t|g;
- Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.b repeat
means always goto (branch) the label repeat
Which can be shortened to
sed ':r;s|\t\t|\t\n\t|g; t r'
# Original answer
# sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'
And the Mac (yet still Linux/Windows compatible) version:
sed $':r\ns|\t\t|\t\\\n\t|g; t r'
# Original answer
# sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With