I thought this will be trivial but I can't get this to work.
Assume a line in a CSV file:
"Barack Obama", 48, "President", "1600 Penn Ave, Washington DC"
string[] tokens = line.split(',')
I expect this:
"Barack Obama"
48
"President"
"1600 Penn Ave, Washington DC"
but the last token is
'Washington DC'
not
"1600 Penn Ave, Washington DC"
.
Is there an easy way to get the split function to ignore the comma within quotes?
I have no control over the CSV file and it doesn;t get sent to me. Customer A will be using the app to read files provided by an external individual.
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
The string split() method breaks a given string around matches of the given regular expression. After splitting against the given regular expression, this method returns a string array. Input String: 016-78967 Regular Expression: - Output : {"016", "78967"}
The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.
I have a SplitWithQualifier
extension method that I use here and there, which utilizes Regex
.
I make no claim as to the robustness of this code, but it has worked all right for me for a while.
// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
public static string[] SplitWithQualifier(this string text,
char delimiter,
char qualifier,
bool stripQualifierFromResult)
{
string pattern = string.Format(
@"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
Regex.Escape(delimiter.ToString()),
Regex.Escape(qualifier.ToString())
);
string[] split = Regex.Split(text, pattern);
if (stripQualifierFromResult)
return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
else
return split;
}
}
Usage:
string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);
foreach (string value in values)
Console.WriteLine(value);
Output:
Barak Obama
48
President
1600 Penn Ave, Washington DC
You might have to write your own split function.
"
character, toggle a booleanHere's an example:
public static class StringExtensions
{
public static string[] SplitQuoted(this string input, char separator, char quotechar)
{
List<string> tokens = new List<string>();
StringBuilder sb = new StringBuilder();
bool escaped = false;
foreach (char c in input)
{
if (c.Equals(separator) && !escaped)
{
// we have a token
tokens.Add(sb.ToString().Trim());
sb.Clear();
}
else if (c.Equals(separator) && escaped)
{
// ignore but add to string
sb.Append(c);
}
else if (c.Equals(quotechar))
{
escaped = !escaped;
sb.Append(c);
}
else
{
sb.Append(c);
}
}
tokens.Add(sb.ToString().Trim());
return tokens.ToArray();
}
}
Then just call:
string[] tokens = line.SplitQuoted(',','\"');
Results of benchmarking my code and Dan Tao's code are below. I'm happy to benchmark any other solutions if people want them?
Code:
string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;
// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
start = DateTime.Now;
for (int i = 0; i<1000000;i++)
tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
Output:
1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted = 2406.25ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With