Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split using delimiter except when delimiter is escaped

I'm reading clipboard data coming from excel using

var stream = (System.IO.Stream) ( Forms.Clipboard.GetDataObject() ).GetData( Forms.DataFormats.CommaSeparatedValue );,

but unfortunately, excel is passing cell text instead of cell values. When the cells are using special formatting (such as the thousands seperator), the clipboard data for a series of cells in columns that looks like this:

 1,234,123.00    2,345.00    342.00      12,345.00

is stored as this:

\" 1,234,123.00 \",\" 2,345.00 \", 342.00 ,\" 12,345.00 \"

when what I really want is this:

 1234123.00, 2345.00, 342.00, 12345.00

I had been previously using the clipData.Split(new string[] { "," }, StringSllitOptions.None)) function to turn my CSV clipboard data into a series of cells, but this fails when there is escaped formatted text containing commas.


I'm asking if anyone can think of a way to split this string into a set of cells, ignoring the commas escaped within the \" bits, since this is how Excel is choosing to escape cells containing commas.

In short, how can I turn a single string containing this:

\" 1,234,123.00 \",\" 2,345.00 \", 342.00 ,\" 12,345.00 \"

into an array of strings containing this:

{ "1,234,123.00", "2,345.00", "342.00", "12,345.00" }

Without ruining my ability to parse a simple comma delimited string.

*****edit***

Follow up question (formulated as a DFA) here: Split a string based on each time a Deterministic Finite Automata reaches a final state?

like image 938
Alain Avatar asked Dec 09 '10 21:12

Alain


People also ask

What is escape delimiter?

An escape delimiter is simply a sequence that is recognized and ignored during parsing. Its purpose is to allow the use of escape sequences to embed byte sequences in data that would otherwise be seen as delimiter occurrences.

When splitting a string using a given separator it returns?

The Split method extracts the substrings in this string that are delimited by one or more of the strings in the separator parameter, and returns those substrings as elements of an array.


2 Answers

First off I've dealt with data from Excel before and what you typically see is comma separated values and if the value is considered to be a string it will have double quotes around it (and can contain commas and double quotes). If it is considered to be numeric then there are not double quotes. Additionally if the data contains a double quote that will be delimited by a double quote like "". So assuming all of that here's how I've dealt with this in the past

public static IEnumerable<string> SplitExcelRow(this string value)
{
    value = value.Replace("\"\"", "&quot;");
    bool quoted = false;
    int currStartIndex = 0;
    for (int i = 0; i < value.Length; i++)
    {
        char currChar = value[i];
        if (currChar == '"')
        {
            quoted = !quoted;       
        }
        else if (currChar == ',')
        {
            if (!quoted)
            {
                yield return value.Substring(currStartIndex, i - currStartIndex)
                    .Trim()
                    .Replace("\"","")
                    .Replace("&quot;","\"");
                currStartIndex = i + 1;
            }
        }
    }
    yield return value.Substring(currStartIndex, value.Length - currStartIndex)
        .Trim()
        .Replace("\"", "")
        .Replace("&quot;", "\"");
}

Of course this assumes the data coming in is valid so if you have something like "fo,o"b,ar","bar""foo" this will not work. Additionally if your data contains &quot; then it will be turned into a " which may or may not be desirable.

like image 80
juharr Avatar answered Sep 21 '22 12:09

juharr


There are a lot of ways to do this. One inelegant way that would work is:

  1. Convert \",\" to tab or some other delimiter (I assume you left out a few \" in your example because otherwise the string is not consistent
  2. Strip all remaining commas
  3. Strip all remaining \"
  4. Convert your delimiter (e.g. tab) back into a comma

Now you have what you wanted in first place

like image 44
Kyle Banerjee Avatar answered Sep 18 '22 12:09

Kyle Banerjee