Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to parse Space Separated Text

I have string like this

 /c SomeText\MoreText "Some Text\More Text\Lol" SomeText

I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant design.

This is in C# btw.

EDIT: My ugly version, while ugly, is O(N) and may actually be faster than using a RegEx.

private string[] tokenize(string input)
{
    string[] tokens = input.Split(' ');
    List<String> output = new List<String>();

    for (int i = 0; i < tokens.Length; i++)
    {
        if (tokens[i].StartsWith("\""))
        {
            string temp = tokens[i];
            int k = 0;
            for (k = i + 1; k < tokens.Length; k++)
            {
                if (tokens[k].EndsWith("\""))
                {
                    temp += " " + tokens[k];
                    break;
                }
                else
                {
                    temp += " " + tokens[k];
                }
            }
            output.Add(temp);
            i = k + 1;
        }
        else
        {
            output.Add(tokens[i]);
        }
    }

    return output.ToArray();            
}
like image 836
FlySwat Avatar asked Sep 10 '08 18:09

FlySwat


1 Answers

The computer term for what you're doing is lexical analysis; read that for a good summary of this common task.

Based on your example, I'm guessing that you want whitespace to separate your words, but stuff in quotation marks should be treated as a "word" without the quotes.

The simplest way to do this is to define a word as a regular expression:

([^"^\s]+)\s*|"([^"]+)"\s*

This expression states that a "word" is either (1) non-quote, non-whitespace text surrounded by whitespace, or (2) non-quote text surrounded by quotes (followed by some whitespace). Note the use of capturing parentheses to highlight the desired text.

Armed with that regex, your algorithm is simple: search your text for the next "word" as defined by the capturing parentheses, and return it. Repeat that until you run out of "words".

Here's the simplest bit of working code I could come up with, in VB.NET. Note that we have to check both groups for data since there are two sets of capturing parentheses.

Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")

While m.Success
    token = m.Groups(1).ToString
    If token.length = 0 And m.Groups.Count > 1 Then
        token = m.Groups(2).ToString
    End If
    m = m.NextMatch
End While

Note 1: Will's answer, above, is the same idea as this one. Hopefully this answer explains the details behind the scene a little better :)

like image 124
Todd Myhre Avatar answered Sep 24 '22 08:09

Todd Myhre