Extract multiple independent regex matches per line

Question

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:

1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"

The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:

    MOUSE_10        XC:Z:TGGTCGGCGCGT       RG:Z:A  XM:Z:GAGTCCGT   ZP:i:33
    MOUSE_10        XC:Z:GAAGCCGCTTCC       NM:i:0  XM:Z:ACCGACGG   AS:i:16
    MOUSE_10        ZP:i:36 XC:Z:TCCCCGGGTACA       NM:i:0  XM:Z:GGGACGGG   ZP:i:28
    MOUSE_10        XC:Z:CAAATTTGGAAA       RG:Z:A  NM:i:1  XM:Z:GCAGATAG

In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:

use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.

The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.

Looking forward to your solutions!

Thanks,

Felix

SLePort · Accepted Answer

With sed you can capture non-space characters after XC:Z: and XM:Z:

sed -n 's/.*XC:Z:$[^[:blank:]]*$.*XM:Z:$[^[:blank:]]*$.*/\1, \2/p;' file

You can add a second s command for reversed values:

sed -n 's/.*XC:Z:$[^[:blank:]]*$.*XM:Z:$[^[:blank:]]*$.*/\1, \2/;s/.*XM:Z:$[^[:blank:]]*$.*XC:Z:$[^[:blank:]]*$.*/\1, \2/;p;' file

Extract multiple independent regex matches per line

Tags:

regex

bash

sed

awk

Felix

1 Answers

SLePort

Recent Activity

Donate For Us

Extract multiple independent regex matches per line

Tags:

regex

bash

sed

awk

Felix

1 Answers

SLePort

Related questions

Recent Activity

Donate For Us