Split multiple fields per line to separate lines using sed, retaining line prefix

Question

Last Friday I got a problem, to transform a text to another format. On that machine, only gnu sed is available, no awk (strange, I know). And I know nothing about perl. so I am looking for a sed only solution.

the file content is:

a  yao.com sina.com
b  kongu.com
c  polm.com unee.net 21cn.com iop.com foo.com bar.com baz.net happy2all.com
d  kinge.net

the required output, (should be a new file) is:

a  yao.com 
a  sina.com
b  kongu.com
c  polm.com 
c  unee.net 
c  21cn.com 
c  iop.com
c  foo.com
c  bar.com
c  baz.net
c  happy2all.com
d  kinge.net

I tried a lot, also searched famous sed oneliner, but I cannot make it... can someone help me?

Chris Seymour · Accepted Answer

Interesting problem:

$ sed -r 's/(\w+\.\w+)/>  &/2g;:a s/^([a-z]+)(.*)>/\1\2
\1/g;ta' file
a  yao.com 
a  sina.com
b  kongu.com
c  polm.com 
c  unee.net 
c  21cn.com 
c  iop.com 
c  foo.com 
c  bar.com 
c  baz.net 
c  happy2all.com
d  kinge.net

Edit:

It works by using two substitutions.

The first puts a > before the URLs that need flattening as a holding character:

$ sed -r 's/(\w+\.\w+)/>  &/2g' file
a  yao.com >  sina.com
b  kongu.com
c  polm.com >  unee.net >  21cn.com >  iop.com >  foo.com >  bar.com ...
d  kinge.net

The second basically replaces the holding > with a newline (uses conditional branching):

$ sed -r ':a s/^([a-z]+)(.*)>/\1\2
\1/g;ta'

Kent · Answer

It is not easy job for sed, particularly, an one liner. however you mentioned "gnu sed". I see the light!

gnu sed supports s/.../.../ge which is useful for this situation:

kent$  sed -r 's@(^[a-z]+) (.*)@echo "\2"\|sed "s# #\n\1  #g"\|sed "/^$/d"@ge' file  
a  yao.com
a  sina.com
b  kongu.com
c  polm.com
c  unee.net
c  21cn.com
c  iop.com
c  foo.com
c  bar.com
c  baz.net
c  happy2all.com
d  kinge.net

short explanation:

the outer sed is sed -r 's@..x..@..y..@ge' file the ge allows us pass matched part to external commands
The ..y.. part is done by the magic of ge. I pass \2 to another sed (via echo) : sed "s# #\n\1 #g" this sed replace all space with + \1 + space
in original file, there is on each line (ending), so there are empty lines in the result of step 2 (above step), we need remove those empty lines "/^$/d"
finally, the substitution in step 1, (the outer sed), could be done, and we get the result.

check info sed for the s/../../ge

edit, added the double spaces as OP commented.

Fredrik Pihl · Answer

As other have noted, a sed solution is tricky so I thought I post a bash-dito:

#!/bin/bash

while read -a array
do
    for i in ${array[@]:1}
    do
        echo ${array[0]} $i
    done
done < input

output:

a yao.com
a sina.com
b kongu.com
c polm.com
c unee.net
c 21cn.com
c iop.com
c foo.com
c bar.com
c baz.net
c happy2all.com
d kinge.net

David Ravetti · Answer

Here is a true sed-only script that works. I've written it below as a file that is called by sed on the command line, but it could all be typed on the command line or all entered into a separate script as well:

Save the following as sedscript (or whatever you want to call it). Explanation follows the output.

:start
    h
    s/$.\ \ [^ ]*$.*/\1/
    t continue
    d
:continue
    p
    x
    s/$.\ $\ [^ ]*$\ .*$/\1\2/
    t start
    d

Now run sed -f sedscript myfile.txt

With your example above saved as myfile.txt, the following is output:

a  yao.com
a  sina.com
b  kongu.com
c  polm.com
c  unee.net
c  21cn.com
c  iop.com
c  foo.com
c  bar.com
c  baz.net
c  happy2all.com
d  kinge.net

Sed has a pattern buffer (where you normally work with s/a/b/ kinds of commands) and a hold buffer. In this script, information is swapped back and forth to the hold buffer to retain the unedited part of a line while working on another part.

:start = label to enable jumping

h = swap the pattern buffer (current line) into the hold buffer

s/$.\ \ [^ ]*$.*/\1/ = While the full line is safe in the hold buffer, strip everything after the first domain, leaving the first desired line (e.g. "a yao.com").

t continue = if the previous command resulted in a substitution, jump to the "continue" label

d = if we didn't jump, that means we're done. Delete the pattern buffer and proceed to the next line of the file.

:continue = label for the previous jump

p = print out the pattern buffer (e.g. "a yao.com")

x = swap the pattern buffer with the hold buffer (could also use g to simply copy the hold buffer over the pattern buffer)

s/$.\ $\ [^ ]*$\ .*$/\1\2/ = The full original string has now been swapped into the pattern buffer - strip off the domain we just dealt with (e.g. "yao.com")

t start = if that wasn't the last domain, start the script over with the new, shortened string.

d = if that was the last domain, delete the pattern buffer and continue to the next line in the file.

Split multiple fields per line to separate lines using sed, retaining line prefix

Tags:

sed

Imagination

4 Answers

Chris Seymour

Kent

Fredrik Pihl

David Ravetti

Recent Activity

Donate For Us

Split multiple fields per line to separate lines using sed, retaining line prefix

Tags:

sed

Imagination

4 Answers

Chris Seymour

Kent

Fredrik Pihl

David Ravetti

Related questions

Recent Activity

Donate For Us