I have some "tokenized" templates, for example (I call tokens the part between double braces):
var template1 = "{{TOKEN1}} is a {{TOKEN2}} and it has some {{TOKEN3}}";
I want to extract an array from this sentence, in order to have something like:
Array("{{TOKEN1}}",
" is a ",
"{{TOKEN2}}",
" and it has some ",
"{{TOKEN3}}");
I've tried to achieve that with the following Regex code:
Regex r = new Regex(@"({{[^\}]*}})");
var n = r.Split(template1);
And the result is:
Array("",
"{{TOKEN1}}",
" is a ",
"{{TOKEN2}}",
" and it has some ",
"{{TOKEN3}}",
"");
The first issue was that I was not able to recover the tokens from the sentence. I solved this just by adding the parentheses on the Regex expression, even though I'm not sure why does it solves this.
The issue I'm currently facing is the extra empty term in the beginning and/or in the end of the array when the first and/or last terms on the template are "tokens". Why is it happening? Am I doing something wrong, or I should I always check these two positions for emptiness?
On my code, I will need to know which term came from a token and which was a fixed position on the template. On this solution, I will have to check every array's position for a string starting with "{{" and ending with "}}", which I don't think is the best possibility. So, if someone comes up with a better solution to break these things apart, I'll be glad to know!
Thank you!
Edit: as requested, I'll post a simple example to why do I need this distinction on tokens and text.
public abstract class TextParts { }
public class TextToken : TextParts { }
public class TextConstant : TextParts { }
var list = new List<TextParts>();
list.Add( new TextToken("{{TOKEN1}}") );
list.Add( new TextConstant(" is a ") );
list.Add( new TextToken("{{TOKEN2}}") );
/* and so on */
This way, I'll have a list of the parts that composes my string and I'll be able to record that on my database to allow future manipulation and substitution. In fact, each of this TOKEN will be replaced by a Regex string.
The objective is that users will be able to input messages like "{{SERVER}} is not listening on port {{PORT}}", and I'll be able to replace "{{SERVER}}" to [a-zA-Z0-9 ]+
and "{{PORT}}" to \d{1,5}
. Makes sense?
I hope this makes the post more clear.
To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.
No, it does not support regex.
Split(String) Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor.
If you split a string along delimiters, and the string starts or ends with a delimiter, that means there is an empty element before/after the first/last delimiter:
Imagine the following line in a CSV file:
,a,b,c,
That CSV row contains the elements ""
, "a"
, "b"
, "c"
, and ""
.
The same thing happens with your {{TOKEN}}
. You could use a different method:
MatchCollection allMatchResults = null;
Regex regexObj = new Regex(@"\{\{[^{}]*\}\}|[^{}]+");
allMatchResults = regexObj.Matches(subjectString);
If single braces may occur within or between tokens, you can also use
Regex regexObj = new Regex(@"\{\{(?:(?!\}\}).)*\}\}|(?:(?!\{\{).)+");
which will be a bit less efficient, though, because of all the lookahead assertions, so you should use this only if you need to.
Edit: I just noticed that there was another question in your post: Why did you need to add parentheses around your regex to make it "work"? Answer: Usually, a split()
command only returns the contents between the delimiters. If you enclose the delimiters (or parts thereof) in capturing parentheses, then whatever is matched within those parentheses will also be added to the resulting list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With