I am not sure if I can do this purely with sed:
I am trying to rearrange lines like this
GF:001,GF:00012,GF:01223<TAB>XXR
GF:001,GF:00012,GF:01223,GF:0666<TAB>XXXR3
to
GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3
Anyone any hints? The cardinality of GF:XXXX is alternating as the length of GF:XXXX is.
I am stuck with sed -n '
'/\(XX.*\)$/' {
s/,/\t\1\n/
}' input
but I cannot reference to the originally matched pattern in the first place. any ideas? cheers!
Update: I think it is not possible to do this with just using sed. So I used perl to do this:
perl -e 'open(IN, "< file");
while (<IN>) {
@a = split(/\t/);
@gos = split(/,/, $a[0]);
foreach (@gos) {
print $_."\t".$a[1];
}
close( IN );' > output
But if anyone knows a way to solve this just with sed
please post it here...
It can be done in sed
, though I probably would use Perl (or Awk or Python) to do it.
I claim no elegance for this solution, but brute force and ignorance sometimes pays off. I created a file called, unoriginally, sed.script
containing:
/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/{
:redo
s/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/\1<TAB>\3@@@@@\2<TAB>\3/
h
s/@@@@@.*//
p
x
s/.*@@@@@//
t redo
d
}
I ran it as:
sed -f sed.script input
where input
contained the two lines shown in the question. It produced the output:
GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3
(I took the liberty of deliberately misinterpreting <TAB>
to be a 5-character sequence instead of a single tab character; you can easily fix the answer to handle an actual tab character instead.)
Explanation of the sed
script:
GF:nnn
separated by commas (we do not need to process lines that contain a single such occurrence). Do the rest of the script only on such lines. Anything else is passed through (printed) unchanged.<TAB>
. Replace this with the first field, <TAB>
, third field, implausible marker pattern (@@@@@
), second field, <TAB>
, third field.redo
label.This is a simple loop that reduces the number of the patterns by one on each iteration.
You can do it straightforwardly with awk:
$ awk '{gsub(/,/, "\t" $NF "\n");print}' input
In this case, we just replace the comma by a tab concatenated with the last field (NF
stores the number of fields of a record; $NF
gets the NF
th field) concatenated with a newline. Then, print the result.
It can be solved with sed, too, in a way similar but IMHO a bit better than the Jonathan solution (which is pretty sophisticated, I should remark).
sed -n '
:BEGIN
h
s/,.*<TAB>/<TAB>/
p
x
s/^[^,]*,//
t BEGIN' input
Here, we define a label in the beginning of the script:
:BEGIN
Then we copy the content of the pattern space to the hold space:
h
Now, we replace everything from the first comma until the tab with only a tab:
s/,.*<TAB>/<TAB>/
We print the result...
p
...and retrieve the content of the hold space:
x
Since we printed the first line - which contains the first GF:XXX
pattern followed by the final XXR
pattern - we remove the first GF:XXX
pattern from the line:
s/^[^,]*,//
If a replacement is executed, we branch to the beginning of script:
t BEGIN
And everything is applied again to the same line, except that now this line does not have the first GF:XXX
pattern anymore. OTOH, if no replacement is made, then the processing of the current line is done and we do not jump to the beginning anymore.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With