Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split tokens on string using Regex in c#

I have some "tokenized" templates, for example (I call tokens the part between double braces):

var template1 = "{{TOKEN1}} is a {{TOKEN2}} and it has some {{TOKEN3}}";

I want to extract an array from this sentence, in order to have something like:

Array("{{TOKEN1}}",
      " is a ",
      "{{TOKEN2}}", 
      " and it has some ", 
      "{{TOKEN3}}");

I've tried to achieve that with the following Regex code:

Regex r = new Regex(@"({{[^\}]*}})");
var n = r.Split(template1);

And the result is:

Array("",
      "{{TOKEN1}}",
      " is a ",
      "{{TOKEN2}}", 
      " and it has some ", 
      "{{TOKEN3}}",
      "");

The first issue was that I was not able to recover the tokens from the sentence. I solved this just by adding the parentheses on the Regex expression, even though I'm not sure why does it solves this.

The issue I'm currently facing is the extra empty term in the beginning and/or in the end of the array when the first and/or last terms on the template are "tokens". Why is it happening? Am I doing something wrong, or I should I always check these two positions for emptiness?

On my code, I will need to know which term came from a token and which was a fixed position on the template. On this solution, I will have to check every array's position for a string starting with "{{" and ending with "}}", which I don't think is the best possibility. So, if someone comes up with a better solution to break these things apart, I'll be glad to know!

Thank you!

Edit: as requested, I'll post a simple example to why do I need this distinction on tokens and text.

public abstract class TextParts { }
public class TextToken : TextParts { }
public class TextConstant : TextParts { }

var list = new List<TextParts>();
list.Add( new TextToken("{{TOKEN1}}") );
list.Add( new TextConstant(" is a ") );
list.Add( new TextToken("{{TOKEN2}}") );
/* and so on */

This way, I'll have a list of the parts that composes my string and I'll be able to record that on my database to allow future manipulation and substitution. In fact, each of this TOKEN will be replaced by a Regex string.

The objective is that users will be able to input messages like "{{SERVER}} is not listening on port {{PORT}}", and I'll be able to replace "{{SERVER}}" to [a-zA-Z0-9 ]+ and "{{PORT}}" to \d{1,5}. Makes sense?

I hope this makes the post more clear.

like image 676
tyron Avatar asked Oct 13 '12 17:10

tyron


People also ask

How split a string in regex?

To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.

Does Strtok take regex?

No, it does not support regex.

What does the string split regex method do?

Split(String) Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor.


1 Answers

If you split a string along delimiters, and the string starts or ends with a delimiter, that means there is an empty element before/after the first/last delimiter:

Imagine the following line in a CSV file:

,a,b,c,

That CSV row contains the elements "", "a", "b", "c", and "".

The same thing happens with your {{TOKEN}}. You could use a different method:

MatchCollection allMatchResults = null;
Regex regexObj = new Regex(@"\{\{[^{}]*\}\}|[^{}]+");
allMatchResults = regexObj.Matches(subjectString);

If single braces may occur within or between tokens, you can also use

Regex regexObj = new Regex(@"\{\{(?:(?!\}\}).)*\}\}|(?:(?!\{\{).)+");

which will be a bit less efficient, though, because of all the lookahead assertions, so you should use this only if you need to.

Edit: I just noticed that there was another question in your post: Why did you need to add parentheses around your regex to make it "work"? Answer: Usually, a split() command only returns the contents between the delimiters. If you enclose the delimiters (or parts thereof) in capturing parentheses, then whatever is matched within those parentheses will also be added to the resulting list.

like image 185
Tim Pietzcker Avatar answered Sep 18 '22 16:09

Tim Pietzcker