Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# - Splitting on a pipe with an escaped pipe in the data?

I've got a pipe delimited file that I would like to split (I'm using C#). For example:

This|is|a|test

However, some of the data can contain a pipe in it. If it does, it will be escaped with a backslash:

This|is|a|pip\|ed|test (this is a pip|ed test)

I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them). My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe. Not very elegant and I can't help but think there's a better way. Thanks for any help.

like image 694
Frijoles Avatar asked Apr 28 '11 04:04

Frijoles


2 Answers

Just use String.IndexOf() to find the next pipe. If the previous character is not a backslash, then use String.Substring() to extract the word. Alternatively, you could use String.IndexOfAny() to find the next occurrence of either the pipe or backslash.

I do a lot of parsing like this, and this is really pretty straight forward. Taking my approach, if done correctly will also tend to run faster as well.

EDIT

In fact, maybe something like this. It would be interesting to see how this compares performance-wise to a RegEx solution.

public List<string> ParseWords(string s)
{
    List<string> words = new List<string>();

    int pos = 0;
    while (pos < s.Length)
    {
        // Get word start
        int start = pos;

        // Get word end
        pos = s.IndexOf('|', pos);
        while (pos > 0 && s[pos - 1] == '\\')
        {
            pos++;
            pos = s.IndexOf('|', pos);
        }

        // Adjust for pipe not found
        if (pos < 0)
            pos = s.Length;

        // Extract this word
        words.Add(s.Substring(start, pos - start));

        // Skip over pipe
        if (pos < s.Length)
            pos++;
    }
    return words;
}
like image 91
Jonathan Wood Avatar answered Oct 04 '22 02:10

Jonathan Wood


This oughta do it:

string test = @"This|is|a|pip\|ed|test (this is a pip|ed test)";
string[] parts = Regex.Split(test, @"(?<!(?<!\\)*\\)\|");

The regular expression basically says: split on pipes that aren't preceded by an escape character. I shouldn't take any credit for this though, I just hijacked the regular expression from this post and simplified it.

EDIT

In terms of performance, compared to the manual parsing method provided in this thread, I found that this Regex implementation is 3 to 5 times slower than Jonathon Wood's implementation using the longer test string provided by the OP.

With that said, if you don't instantiate or add the words to a List<string> and return void instead, Jon's method comes in at about 5 times faster than the Regex.Split() method (0.01ms vs. 0.002ms) for purely splitting up the string. If you add back the overhead of managing and returning a List<string>, it was about 3.6 times faster (0.01ms vs. 0.00275ms), averaged over a few sets of a million iterations. I did not use the static Regex.Split() for this test, I instead created a new Regex instance with the expression above outside of my test loop and then called its Split method.

UPDATE

Using the static Regex.Split() function is actually a lot faster than reusing an instance of the expression. With this implementation, the use of regex is only about 1.6 times slower than Jon's implementation (0.0043ms vs. 0.00275ms)

The results were the same using the extended regular expression from the post I linked to.

like image 33
Cᴏʀʏ Avatar answered Oct 04 '22 04:10

Cᴏʀʏ