Gawk regexp to select sequence

Question

sorry for the nth simple question on regexp but I'm not able to get what I need without a what seems to me a too complicated solution. I'm parsing a file containing sequence of only 3 letters A,E,D as in

AADDEEDDA

EEEEEEEE

AEEEDEEA

AEEEDDAAA

and I'd like to identify only those that start with E and ends in D with only one change in the sequence as for example in

EDDDDDDDD

EEEDDDDDD

EEEEEEEED

I'm fighting with the proper regexp to do that. Here my last attempt

echo "1,AAEDDEED,1 2,EEEEDDDD,2 3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'

which does not work. Any help?

Thanks in advance.

Giuseppe Ricupero · Accepted Answer

If i understand correctly your request a simple

awk '/^E+D+$/' file.input

will do the trick.

UPDATE: if the line format contains pre/post numbers (with post optional) as showed later in the example, this can be a possible pure regex adaptation (alternative to the use of field switch-F,):

awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test

Gawk regexp to select sequence

Tags:

regex

gawk

G. Tartifola

1 Answers

Giuseppe Ricupero

Recent Activity

Donate For Us

Gawk regexp to select sequence

Tags:

regex

gawk

G. Tartifola

1 Answers

Giuseppe Ricupero

Related questions

Recent Activity

Donate For Us