Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gawk regexp to select sequence

Tags:

regex

gawk

sorry for the nth simple question on regexp but I'm not able to get what I need without a what seems to me a too complicated solution. I'm parsing a file containing sequence of only 3 letters A,E,D as in

AADDEEDDA

EEEEEEEE

AEEEDEEA

AEEEDDAAA

and I'd like to identify only those that start with E and ends in D with only one change in the sequence as for example in

EDDDDDDDD

EEEDDDDDD

EEEEEEEED

I'm fighting with the proper regexp to do that. Here my last attempt

echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'

which does not work. Any help?

Thanks in advance.

like image 383
G. Tartifola Avatar asked Feb 09 '23 14:02

G. Tartifola


1 Answers

If i understand correctly your request a simple

awk '/^E+D+$/' file.input

will do the trick.

UPDATE: if the line format contains pre/post numbers (with post optional) as showed later in the example, this can be a possible pure regex adaptation (alternative to the use of field switch-F,):

awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test
like image 175
Giuseppe Ricupero Avatar answered Feb 15 '23 09:02

Giuseppe Ricupero