Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split multiple fields per line to separate lines using sed, retaining line prefix

Tags:

sed

Last Friday I got a problem, to transform a text to another format. On that machine, only gnu sed is available, no awk (strange, I know). And I know nothing about perl. so I am looking for a sed only solution.

the file content is:

a  yao.com sina.com
b  kongu.com
c  polm.com unee.net 21cn.com iop.com foo.com bar.com baz.net happy2all.com
d  kinge.net

the required output, (should be a new file) is:

a  yao.com 
a  sina.com
b  kongu.com
c  polm.com 
c  unee.net 
c  21cn.com 
c  iop.com
c  foo.com
c  bar.com
c  baz.net
c  happy2all.com
d  kinge.net

I tried a lot, also searched famous sed oneliner, but I cannot make it... can someone help me?

like image 293
Imagination Avatar asked Mar 16 '13 21:03

Imagination


4 Answers

Interesting problem:

$ sed -r 's/(\w+\.\w+)/>  &/2g;:a s/^([a-z]+)(.*)>/\1\2\n\1/g;ta' file
a  yao.com 
a  sina.com
b  kongu.com
c  polm.com 
c  unee.net 
c  21cn.com 
c  iop.com 
c  foo.com 
c  bar.com 
c  baz.net 
c  happy2all.com
d  kinge.net

Edit:

It works by using two substitutions.

The first puts a > before the URLs that need flattening as a holding character:

$ sed -r 's/(\w+\.\w+)/>  &/2g' file
a  yao.com >  sina.com
b  kongu.com
c  polm.com >  unee.net >  21cn.com >  iop.com >  foo.com >  bar.com ...
d  kinge.net

The second basically replaces the holding > with a newline (uses conditional branching):

$ sed -r ':a s/^([a-z]+)(.*)>/\1\2\n\1/g;ta'
like image 185
Chris Seymour Avatar answered Nov 02 '22 00:11

Chris Seymour


It is not easy job for sed, particularly, an one liner. however you mentioned "gnu sed". I see the light!

gnu sed supports s/.../.../ge which is useful for this situation:

kent$  sed -r 's@(^[a-z]+) (.*)@echo "\2"\|sed "s# #\\n\1  #g"\|sed "/^$/d"@ge' file  
a  yao.com
a  sina.com
b  kongu.com
c  polm.com
c  unee.net
c  21cn.com
c  iop.com
c  foo.com
c  bar.com
c  baz.net
c  happy2all.com
d  kinge.net

short explanation:

  1. the outer sed is sed -r '[email protected][email protected]..@ge' file the ge allows us pass matched part to external commands
  2. The ..y.. part is done by the magic of ge. I pass \2 to another sed (via echo) : sed "s# #\\n\1 #g" this sed replace all space with \n + \1 + space
  3. in original file, there is \n on each line (ending), so there are empty lines in the result of step 2 (above step), we need remove those empty lines "/^$/d"
  4. finally, the substitution in step 1, (the outer sed), could be done, and we get the result.

check info sed for the s/../../ge

edit, added the double spaces as OP commented.

like image 31
Kent Avatar answered Nov 02 '22 01:11

Kent


As other have noted, a sed solution is tricky so I thought I post a bash-dito:

#!/bin/bash

while read -a array
do
    for i in ${array[@]:1}
    do
        echo ${array[0]} $i
    done
done < input

output:

a yao.com
a sina.com
b kongu.com
c polm.com
c unee.net
c 21cn.com
c iop.com
c foo.com
c bar.com
c baz.net
c happy2all.com
d kinge.net
like image 1
Fredrik Pihl Avatar answered Nov 01 '22 23:11

Fredrik Pihl


Here is a true sed-only script that works. I've written it below as a file that is called by sed on the command line, but it could all be typed on the command line or all entered into a separate script as well:

Save the following as sedscript (or whatever you want to call it). Explanation follows the output.

:start
    h
    s/\(.\ \ [^ ]*\).*/\1/
    t continue
    d
:continue
    p
    x
    s/\(.\ \)\ [^ ]*\(\ .*\)/\1\2/
    t start
    d

Now run sed -f sedscript myfile.txt

With your example above saved as myfile.txt, the following is output:

a  yao.com
a  sina.com
b  kongu.com
c  polm.com
c  unee.net
c  21cn.com
c  iop.com
c  foo.com
c  bar.com
c  baz.net
c  happy2all.com
d  kinge.net

Sed has a pattern buffer (where you normally work with s/a/b/ kinds of commands) and a hold buffer. In this script, information is swapped back and forth to the hold buffer to retain the unedited part of a line while working on another part.

:start = label to enable jumping

h = swap the pattern buffer (current line) into the hold buffer

s/\(.\ \ [^ ]*\).*/\1/ = While the full line is safe in the hold buffer, strip everything after the first domain, leaving the first desired line (e.g. "a yao.com").

t continue = if the previous command resulted in a substitution, jump to the "continue" label

d = if we didn't jump, that means we're done. Delete the pattern buffer and proceed to the next line of the file.

:continue = label for the previous jump

p = print out the pattern buffer (e.g. "a yao.com")

x = swap the pattern buffer with the hold buffer (could also use g to simply copy the hold buffer over the pattern buffer)

s/\(.\ \)\ [^ ]*\(\ .*\)/\1\2/ = The full original string has now been swapped into the pattern buffer - strip off the domain we just dealt with (e.g. "yao.com")

t start = if that wasn't the last domain, start the script over with the new, shortened string.

d = if that was the last domain, delete the pattern buffer and continue to the next line in the file.

like image 1
David Ravetti Avatar answered Nov 02 '22 01:11

David Ravetti