Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regex for genome puzzle

Tags:

java

regex

I was assigned a problem to find genes when given a string of the letters A,C,G, or T all in a row, like ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA. A gene is started with ATG, and ends with either TAA, TAG, or TGA (the gene excludes both endpoints). The gene consists of triplets of letters, so its length is a multiple of three, and none of those triplets can be the start/end triplets listed above. So, for the string above the genes in it are CTCTCT and CACACACACACA. And in fact my regex works for that particular string. Here's what I have so far (and I'm pretty happy with myself that I got this far):

(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)

However, if there is an ATG and end-triplet within another result, and not aligned with the triplets of that result, it fails. For example:

Results for TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGG :
TTGCTTATTGTTTTGAATGGGGTAGGA
ACCTGC

It should find also a GGG but doesn't: TTGCTTATTGTTTTGA(ATG|GGG|TAG)GA

I'm new to regex in general and a little stuck...just a little hint would be awesome!

like image 849
Swordbeard Avatar asked Sep 10 '10 13:09

Swordbeard


2 Answers

The problem is that the regular expression consumes the characters that it matches and then they are not used again.

You can solve this by either using a zero-width match (in which case you only get the index of the match, not the characters that matched).

Alternatively you can use three similar regular expressions, but each using a different offset:

(?=(.{3})+$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+.$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+..$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)

You might also want to consider using a different approach that doesn't involve regular expressions as the above regular expression would be slow.

like image 196
Mark Byers Avatar answered Sep 29 '22 02:09

Mark Byers


The problem with things like this is that you can slowly build up a regex, rule by rule, until you have something taht works.

Then your requirements change and you have to start all over again, because its nearly impossible for mere mortals to easily reverse engineer a complex regex.

Personally, I'd rather do it the 'old fashioned' way - use string manipulation. Each stage can be easily commented, and if there's a slight change in the requirements you can just tweak a particular stage.

like image 25
PaulJWilliams Avatar answered Sep 29 '22 01:09

PaulJWilliams