Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split String in C#

Tags:

c#

I thought this will be trivial but I can't get this to work.

Assume a line in a CSV file: "Barack Obama", 48, "President", "1600 Penn Ave, Washington DC"

string[] tokens = line.split(',')

I expect this:

 "Barack Obama"
 48
 "President"
 "1600 Penn Ave, Washington DC"

but the last token is 'Washington DC' not "1600 Penn Ave, Washington DC".

Is there an easy way to get the split function to ignore the comma within quotes?

I have no control over the CSV file and it doesn;t get sent to me. Customer A will be using the app to read files provided by an external individual.

like image 341
ritu Avatar asked May 11 '10 01:05

ritu


People also ask

How do you split a string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

What is string split () and give its syntax?

The string split() method breaks a given string around matches of the given regular expression. After splitting against the given regular expression, this method returns a string array. Input String: 016-78967 Regular Expression: - Output : {"016", "78967"}

What is split () function used for?

The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.


2 Answers

I have a SplitWithQualifier extension method that I use here and there, which utilizes Regex.

I make no claim as to the robustness of this code, but it has worked all right for me for a while.

// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
    public static string[] SplitWithQualifier(this string text,
                                              char delimiter,
                                              char qualifier,
                                              bool stripQualifierFromResult)
    {
        string pattern = string.Format(
            @"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
            Regex.Escape(delimiter.ToString()),
            Regex.Escape(qualifier.ToString())
        );

        string[] split = Regex.Split(text, pattern);

        if (stripQualifierFromResult)
            return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
        else
            return split;
    }
}

Usage:

string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);

foreach (string value in values)
    Console.WriteLine(value);

Output:

Barak Obama
48
President
1600 Penn Ave, Washington DC
like image 56
Dan Tao Avatar answered Sep 19 '22 12:09

Dan Tao


You might have to write your own split function.

  • Iterate through each char in the string
  • When you hit a " character, toggle a boolean
  • When you hit a comma, if the bool is true, ignore it, else, you have your token

Here's an example:

public static class StringExtensions
{
    public static string[] SplitQuoted(this string input, char separator, char quotechar)
    {
        List<string> tokens = new List<string>();

        StringBuilder sb = new StringBuilder();
        bool escaped = false;
        foreach (char c in input)
        {
            if (c.Equals(separator) && !escaped)
            {
                // we have a token
                tokens.Add(sb.ToString().Trim());
                sb.Clear();
            }
            else if (c.Equals(separator) && escaped)
            {
                // ignore but add to string
                sb.Append(c);
            }
            else if (c.Equals(quotechar))
            {
                escaped = !escaped;
                sb.Append(c);
            }
            else
            {
                sb.Append(c);
            }
        }
        tokens.Add(sb.ToString().Trim());

        return tokens.ToArray();
    }
}

Then just call:

string[] tokens = line.SplitQuoted(',','\"');

Benchmarks

Results of benchmarking my code and Dan Tao's code are below. I'm happy to benchmark any other solutions if people want them?

Code:

string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;

// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
    tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);

start = DateTime.Now;
for (int i = 0; i<1000000;i++)
    tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted =        {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);

Output:

1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted =        2406.25ms
like image 24
Damovisa Avatar answered Sep 21 '22 12:09

Damovisa