I have some "tokenized" templates, for example (I call tokens the part between double braces): <pre class="prettyprint lang-cs prettyprint-override"><code>var template1 = "{{TOKEN1}} is a {{TOKEN2}} and it has some {{TOKEN3}}"; </code></pre> I want to extract an array from this sentence, in order to have something like: <pre class="prettyprint lang-cs prettyprint-override"><code>Array("{{TOKEN1}}", " is a ", "{{TOKEN2}}", " and it has some ", "{{TOKEN3}}"); </code></pre> I've tried to achieve that with the following Regex code: <pre class="prettyprint lang-cs prettyprint-override"><code>Regex r = new Regex(@"({{[^\}]*}})"); var n = r.Split(template1); </code></pre> And the result is: <pre class="prettyprint lang-cs prettyprint-override"><code>Array("", "{{TOKEN1}}", " is a ", "{{TOKEN2}}", " and it has some ", "{{TOKEN3}}", ""); </code></pre> The first issue was that I was not able to recover the tokens from the sentence. I solved this just by adding the parentheses on the Regex expression, even though I'm not sure why does it solves this. The issue I'm currently facing is the extra empty term in the beginning and/or in the end of the array when the first and/or last terms on the template are "tokens". Why is it happening? Am I doing something wrong, or I should I always check these two positions for emptiness? On my code, I will need to know which term came from a token and which was a fixed position on the template. On this solution, I will have to check every array's position for a string starting with "{{" and ending with "}}", which I don't think is the best possibility. So, if someone comes up with a better solution to break these things apart, I'll be glad to know! Thank you! Edit: as requested, I'll post a simple example to why do I need this distinction on tokens and text. <pre class="prettyprint lang-cs prettyprint-override"><code>public abstract class TextParts { } public class TextToken : TextParts { } public class TextConstant : TextParts { } var list = new List<TextParts>(); list.Add( new TextToken("{{TOKEN1}}") ); list.Add( new TextConstant(" is a ") ); list.Add( new TextToken("{{TOKEN2}}") ); /* and so on */ </code></pre> This way, I'll have a list of the parts that composes my string and I'll be able to record that on my database to allow future manipulation and substitution. In fact, each of this TOKEN will be replaced by a Regex string. The objective is that users will be able to input messages like "{{SERVER}} is not listening on port {{PORT}}", and I'll be able to replace "{{SERVER}}" to <code>[a-zA-Z0-9 ]+</code> and "{{PORT}}" to <code>\d{1,5}</code>. Makes sense? I hope this makes the post more clear.

If you split a string along delimiters, and the string starts or ends with a delimiter, that means there is an empty element before/after the first/last delimiter: Imagine the following line in a CSV file: <pre class="prettyprint"><code>,a,b,c, </code></pre> That CSV row contains the elements <code>""</code>, <code>"a"</code>, <code>"b"</code>, <code>"c"</code>, and <code>""</code>. The same thing happens with your <code>{{TOKEN}}</code>. You could use a different method: <pre class="prettyprint"><code>MatchCollection allMatchResults = null; Regex regexObj = new Regex(@"\{\{[^{}]*\}\}|[^{}]+"); allMatchResults = regexObj.Matches(subjectString); </code></pre> If single braces may occur within or between tokens, you can also use <pre class="prettyprint"><code>Regex regexObj = new Regex(@"\{\{(?:(?!\}\}).)*\}\}|(?:(?!\{\{).)+"); </code></pre> which will be a bit less efficient, though, because of all the lookahead assertions, so you should use this only if you need to. Edit: I just noticed that there was another question in your post: Why did you need to add parentheses around your regex to make it "work"? Answer: Usually, a <code>split()</code> command only returns the contents between the delimiters. If you enclose the delimiters (or parts thereof) in capturing parentheses, then whatever is matched within those parentheses will also be added to the resulting list.

Split tokens on string using Regex in c#

Tags:

c#

regex

split

tokenize

I have some "tokenized" templates, for example (I call tokens the part between double braces):

var template1 = "{{TOKEN1}} is a {{TOKEN2}} and it has some {{TOKEN3}}";

I want to extract an array from this sentence, in order to have something like:

Array("{{TOKEN1}}",
      " is a ",
      "{{TOKEN2}}", 
      " and it has some ", 
      "{{TOKEN3}}");

I've tried to achieve that with the following Regex code:

Regex r = new Regex(@"({{[^\}]*}})");
var n = r.Split(template1);

And the result is:

Array("",
      "{{TOKEN1}}",
      " is a ",
      "{{TOKEN2}}", 
      " and it has some ", 
      "{{TOKEN3}}",
      "");

The first issue was that I was not able to recover the tokens from the sentence. I solved this just by adding the parentheses on the Regex expression, even though I'm not sure why does it solves this.

The issue I'm currently facing is the extra empty term in the beginning and/or in the end of the array when the first and/or last terms on the template are "tokens". Why is it happening? Am I doing something wrong, or I should I always check these two positions for emptiness?

On my code, I will need to know which term came from a token and which was a fixed position on the template. On this solution, I will have to check every array's position for a string starting with "{{" and ending with "}}", which I don't think is the best possibility. So, if someone comes up with a better solution to break these things apart, I'll be glad to know!

Thank you!

Edit: as requested, I'll post a simple example to why do I need this distinction on tokens and text.

public abstract class TextParts { }
public class TextToken : TextParts { }
public class TextConstant : TextParts { }

var list = new List<TextParts>();
list.Add( new TextToken("{{TOKEN1}}") );
list.Add( new TextConstant(" is a ") );
list.Add( new TextToken("{{TOKEN2}}") );
/* and so on */

This way, I'll have a list of the parts that composes my string and I'll be able to record that on my database to allow future manipulation and substitution. In fact, each of this TOKEN will be replaced by a Regex string.

The objective is that users will be able to input messages like "{{SERVER}} is not listening on port {{PORT}}", and I'll be able to replace "{{SERVER}}" to [a-zA-Z0-9 ]+ and "{{PORT}}" to \d{1,5}. Makes sense?

I hope this makes the post more clear.

676

asked Oct 13 '12 17:10

tyron

1 Answers

If you split a string along delimiters, and the string starts or ends with a delimiter, that means there is an empty element before/after the first/last delimiter:

Imagine the following line in a CSV file:

,a,b,c,

That CSV row contains the elements "", "a", "b", "c", and "".

The same thing happens with your {{TOKEN}}. You could use a different method:

MatchCollection allMatchResults = null;
Regex regexObj = new Regex(@"\{\{[^{}]*\}\}|[^{}]+");
allMatchResults = regexObj.Matches(subjectString);

If single braces may occur within or between tokens, you can also use

Regex regexObj = new Regex(@"\{\{(?:(?!\}\}).)*\}\}|(?:(?!\{\{).)+");

which will be a bit less efficient, though, because of all the lookahead assertions, so you should use this only if you need to.

Edit: I just noticed that there was another question in your post: Why did you need to add parentheses around your regex to make it "work"? Answer: Usually, a split() command only returns the contents between the delimiters. If you enclose the delimiters (or parts thereof) in capturing parentheses, then whatever is matched within those parentheses will also be added to the resulting list.

185

answered Sep 18 '22 16:09

Tim Pietzcker

Related questions
                            
                                WNetAddConnection2 and error 1219 - Automatically disconnect?
                            
                                Pause Kinect Camera - Possible error in SDK reguarding event handler
                            
                                How to: Write a thread-safe method that may only be called once?
                            
                                What happens to the returned value after exception is thrown in finally block?
                            
                                Simplify Overriding Equals(), GetHashCode() in C# for Better Maintainability
                            
                                Is there a real time IIS traffic viewer (or way to programmatically get it via C#)
                            
                                Saving Visual Studio's Configuration Manager Settings Locally
                            
                                Kinect Depth and Image Frames Alignment
                            
                                I want a design alternative to a singleton
                            
                                ASP.Net Request Life Cycle - Application_BeginRequest
                            
                                Linq-to-EF DateTime.ToLocalTime not supported
                            
                                Using process.start in a wpf application to invoke another wpf application
                            
                                Attached property to update style trigger on event
                            
                                How can I handle async exceptions using System.Net.Http.HttpClient with my integration tests?
                            
                                Access JavaScript array elements from C# (via WebBrowser)?
                            
                                Mock one method on a class instead of the whole class using Moq?
                            
                                DeploymentItem behaving differently in VS2010 and VS2012
                            
                                What's the most efficient way to get only the final row of a SQL table using EF4?
                            
                                The (nearly) best way to manage a list with shifting items
                            
                                Is HttpWebRequest.GetResponse required to complete a POST?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With