Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently split a string in format "{ {}, {}, ...}"

I have a string in the following format.

string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}"

private void parsestring(string input)
{
    string[] tokens = input.Split(','); // I thought this would split on the , seperating the {}
    foreach (string item in tokens)     // but that doesn't seem to be what it is doing
    {
       Console.WriteLine(item); 
    }
}

My desired output should be something like this below:

112,This is the first day 23/12/2009
132,This is the second day 24/12/2009

But currently, I get the one below:

{112
This is the first day 23/12/2009
{132
This is the second day 24/12/2009

I am very new to C# and any help would be appreciated.

like image 806
Analia Avatar asked Nov 30 '22 21:11

Analia


2 Answers

Don't fixate on Split() being the solution! This is a simple thing to parse without it. Regex answers are probably also OK, but I imagine in terms of raw efficiency making "a parser" would do the trick.

IEnumerable<string> Parse(string input)
{
    var results = new List<string>();
    int startIndex = 0;            
    int currentIndex = 0;

    while (currentIndex < input.Length)
    {
        var currentChar = input[currentIndex];
        if (currentChar == '{')
        {
            startIndex = currentIndex + 1;
        }
        else if (currentChar == '}')
        {
            int endIndex = currentIndex - 1;
            int length = endIndex - startIndex + 1;
            results.Add(input.Substring(startIndex, length));
        }

        currentIndex++;
    }

    return results;
}

So it's not short on lines. It iterates once, and only performs one allocation per "result". With a little tweaking I could probably make a C#8 version with Index types that cuts on allocations? This is probably good enough.

You could spend a whole day figuring out how to understand the regex, but this is as simple as it comes:

  • Scan every character.
  • If you find {, note the next character is the start of a result.
  • If you find }, consider everything from the last noted "start" until the index before this character as "a result".

This won't catch mismatched brackets and could throw exceptions for strings like "}}{". You didn't ask for handling those cases, but it's not too hard to improve this logic to catch it and scream about it or recover.

For example, you could reset startIndex to something like -1 when } is found. From there, you can deduce if you find { when startIndex != -1 you've found "{{". And you can deduce if you find } when startIndex == -1, you've found "}}". And if you exit the loop with startIndex < -1, that's an opening { with no closing }. that leaves the string "}whoops" as an uncovered case, but it could be handled by initializing startIndex to, say, -2 and checking for that specifically. Do that with a regex, and you'll have a headache.

The main reason I suggest this is you said "efficiently". icepickle's solution is nice, but Split() makes one allocation per token, then you perform allocations for each TrimX() call. That's not "efficient". That's "n + 2 allocations".

like image 114
OwenP Avatar answered Dec 04 '22 14:12

OwenP


Use Regex for this:

string[] tokens = Regex.Split(input, @"}\s*,\s*{")
  .Select(i => i.Replace("{", "").Replace("}", ""))
  .ToArray();

Pattern explanation:

\s* - match zero or more white space characters

like image 20
Michał Turczyn Avatar answered Dec 04 '22 16:12

Michał Turczyn