On reading this question, I thought the following problem would be simple using StringSplit
Given the following string, I want to 'cut' it to the left of every "D" such that:
I get a List of fragments (with sequence unchanged)
StringJoin
@fragments gives back the original string (but is does not matter if I have to reorder the fragments to obtain this). That is, sequence within each fragment is important, and I do not want to lose any characters.
(The example I am interested in is a protein sequence (string) where each character represents an amino acid in one-letter code. I want to obtain the theoretical list of ALL fragments obtained by treating with an enzyme known to split before "D")
str = "MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"
The best I can come up with is to insert a space before each "D" using StringReplace
and then use StringSplit
. This seems quite awkward, to say the least.
frags1 = StringSplit@StringReplace[str, "D" -> " D"]
giving as output:
{"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
or, alternatively, using StringReplacePart
:
frags1alt =
StringSplit@StringReplacePart[str, " D", StringPosition[str, "D"]]
Finally (and more realistically), if I want to split before "D" provided that the residue immediately preceding it is not "P" [ie P-D,(Pro-Asp) bonds are not cleaved], I do it as follows:
StringSplit@StringReplace[str, (x_ /; x != "P") ~~ "D" -> x ~~ " D"]
Is there a more elegant way?
Speed is not necessarily an issue. I am unlikely to be dealing with strings of greater than, say, 500 characters. I am using Mma 7.
Update
I have added the bioinformatics tag, and I thought it might be of interest to add an example from that field.
The following imports a protein sequence (Bovine serum albumin, accession number 3336842) from the NCBI database using eutils and then generates a (theoretical) trypsin digest. I have assumed that the enzyme tripsin cleaves between residues A1-A2 when A1 is either "R" or "K", provided that A2 is not "R", "K" or "P". If anyone has any suggestions for improvements, please feel free to suggest modifications.
Using a modification of sakra's method ( a carriage return after '?db=' possibly needs to be removed):
StringJoin /@
Split[Characters[#],
And @@ Function[x, #1 != x] /@ {"R", "K"} ||
Or @@ Function[xx, #2 == xx] /@ {"R", "K", "P"} &] & @
StringJoin@
Rest@Import[
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
protein&id=3336842&rettype=fasta&retmode=text", "Data"]
My possibly ham-fisted attempt at using the regex method (Sasha/WReach) to do the same thing:
StringSplit[#, RegularExpression["(?![PKR])(?<=[KR])"]] &@
StringJoin@Rest@Import[...]
Output
{MK,WVTFISLLLLFSSAYSR,GVFRR,<<69>>,CCAADDK,EACFAVEGPK,LVVSTQTALA}
I can not build anything much simpler that your code. Here is a regex code, which you might happen to like:
In[281]:= StringSplit@
StringReplace[str, RegularExpression["(?<!P)D"] -> " D"]
Out[281]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}
It uses negative lookbehind pattern, borrowed from this site.
In[2]:= StringSplit[str, RegularExpression["(?<!P)(?=D)"]]
Out[2]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}
Here are some alternate solutions:
Splitting by any occurrence of "D":
In[18]:= StringJoin /@ Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" &]
Out[18]:= {"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
Splitting by any occurrence of "D" provided it is not preceded by "P":
In[19]:= StringJoin /@ Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" || #1=="P" &]
Out[19]:= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
Your first solution isn't that bad, is it? Everything that I can think of is longer or uglier than that. Is the problem there might be spaces in the original string?
StringCases[str, "D" | StartOfString ~~ Longest[Except["D"] ..]]
or
Prepend["D" <> # & /@ Rest[StringSplit[str, "D"]], First[StringSplit[str, "D"]]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With