I have the following type of string
var string = "'string, duppi, du', 23, lala"
I want to split the string into an array on each comma, but only the commas outside the single quotation marks.
I can't figure out the right regular expression for the split...
string.split(/,/)
will give me
["'string", " duppi", " du'", " 23", " lala"]
but the result should be:
["string, duppi, du", "23", "lala"]
Is there a cross-browser solution?
Re: Handling 'comma' in the data while writing to a CSV. So for data fields that contain a comma, you should just be able to wrap them in a double quote. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
Use the String. split() method to convert a comma separated string to an array, e.g. const arr = str. split(',') . The split() method will split the string on each occurrence of a comma and will return an array containing the results.
2014-12-01 Update: The answer below works only for one very specific format of CSV. As correctly pointed out by DG in the comments, this solution does NOT fit the RFC 4180 definition of CSV and it also does NOT fit MS Excel format. This solution simply demonstrates how one can parse one (non-standard) CSV line of input which contains a mix of string types, where the strings may contain escaped quotes and commas.
As austincheney correctly points out, you really need to parse the string from start to finish if you wish to properly handle quoted strings that may contain escaped characters. Also, the OP does not clearly define what a "CSV string" really is. First we must define what constitutes a valid CSV string and its individual values.
For the purpose of this discussion, a "CSV string" consists of zero or more values, where multiple values are separated by a comma. Each value may consist of:
Rules/Notes:
'that\'s cool'
.\'
in single quoted values.\"
in double quoted values.A JavaScript function which converts a valid CSV string (as defined above) into an array of string values.
The regular expressions used by this solution are complex. And (IMHO) all non-trivial regexes should be presented in free-spacing mode with lots of comments and indentation. Unfortunately, JavaScript does not allow free-spacing mode. Thus, the regular expressions implemented by this solution are first presented in native regex syntax (expressed using Python's handy: r'''...'''
raw-multi-line-string syntax).
First here is a regular expression which validates that a CVS string meets the above requirements:
re_valid = r""" # Validate a CSV string having single, double or un-quoted values. ^ # Anchor to start of string. \s* # Allow whitespace before value. (?: # Group for value alternatives. '[^'\\]*(?:\\[\S\s][^'\\]*)*' # Either Single quoted string, | "[^"\\]*(?:\\[\S\s][^"\\]*)*" # or Double quoted string, | [^,'"\s\\]*(?:\s+[^,'"\s\\]+)* # or Non-comma, non-quote stuff. ) # End group of value alternatives. \s* # Allow whitespace after value. (?: # Zero or more additional values , # Values separated by a comma. \s* # Allow whitespace before value. (?: # Group for value alternatives. '[^'\\]*(?:\\[\S\s][^'\\]*)*' # Either Single quoted string, | "[^"\\]*(?:\\[\S\s][^"\\]*)*" # or Double quoted string, | [^,'"\s\\]*(?:\s+[^,'"\s\\]+)* # or Non-comma, non-quote stuff. ) # End group of value alternatives. \s* # Allow whitespace after value. )* # Zero or more additional values $ # Anchor to end of string. """
If a string matches the above regex, then that string is a valid CSV string (according to the rules previously stated) and may be parsed using the following regex. The following regex is then used to match one value from the CSV string. It is applied repeatedly until no more matches are found (and all values have been parsed).
re_value = r""" # Match one value in valid CSV string. (?!\s*$) # Don't match empty last value. \s* # Strip whitespace before value. (?: # Group for value alternatives. '([^'\\]*(?:\\[\S\s][^'\\]*)*)' # Either $1: Single quoted string, | "([^"\\]*(?:\\[\S\s][^"\\]*)*)" # or $2: Double quoted string, | ([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*) # or $3: Non-comma, non-quote stuff. ) # End group of value alternatives. \s* # Strip whitespace after value. (?:,|$) # Field ends on comma or EOS. """
Note that there is one special case value that this regex does not match - the very last value when that value is empty. This special "empty last value" case is tested for and handled by the js function which follows.
// Return array of string values, or NULL if CSV string not well formed. function CSVtoArray(text) { var re_valid = /^\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*(?:,\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*)*$/; var re_value = /(?!\s*$)\s*(?:'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*))\s*(?:,|$)/g; // Return NULL if input string is not well formed CSV string. if (!re_valid.test(text)) return null; var a = []; // Initialize array to receive values. text.replace(re_value, // "Walk" the string using replace with callback. function(m0, m1, m2, m3) { // Remove backslash from \' in single quoted values. if (m1 !== undefined) a.push(m1.replace(/\\'/g, "'")); // Remove backslash from \" in double quoted values. else if (m2 !== undefined) a.push(m2.replace(/\\"/g, '"')); else if (m3 !== undefined) a.push(m3); return ''; // Return empty string. }); // Handle special case of empty last value. if (/,\s*$/.test(text)) a.push(''); return a; };
In the following examples, curly braces are used to delimit the {result strings}
. (This is to help visualize leading/trailing spaces and zero-length strings.)
// Test 1: Test string from original question. var test = "'string, duppi, du', 23, lala"; var a = CSVtoArray(test); /* Array hes 3 elements: a[0] = {string, duppi, du} a[1] = {23} a[2] = {lala} */
// Test 2: Empty CSV string. var test = ""; var a = CSVtoArray(test); /* Array hes 0 elements: */
// Test 3: CSV string with two empty values. var test = ","; var a = CSVtoArray(test); /* Array hes 2 elements: a[0] = {} a[1] = {} */
// Test 4: Double quoted CSV string having single quoted values. var test = "'one','two with escaped \' single quote', 'three, with, commas'"; var a = CSVtoArray(test); /* Array hes 3 elements: a[0] = {one} a[1] = {two with escaped ' single quote} a[2] = {three, with, commas} */
// Test 5: Single quoted CSV string having double quoted values. var test = '"one","two with escaped \" double quote", "three, with, commas"'; var a = CSVtoArray(test); /* Array hes 3 elements: a[0] = {one} a[1] = {two with escaped " double quote} a[2] = {three, with, commas} */
// Test 6: CSV string with whitespace in and around empty and non-empty values. var test = " one , 'two' , , ' four' ,, 'six ', ' seven ' , "; var a = CSVtoArray(test); /* Array hes 8 elements: a[0] = {one} a[1] = {two} a[2] = {} a[3] = { four} a[4] = {} a[5] = {six } a[6] = { seven } a[7] = {} */
This solution requires that the CSV string be "valid". For example, unquoted values may not contain backslashes or quotes, e.g. the following CSV string is NOT valid:
var invalid1 = "one, that's me!, escaped \, comma"
This is not really a limitation because any sub-string may be represented as either a single or double quoted value. Note also that this solution represents only one possible definition for: "Comma Separated Values".
Edit: 2014-05-19: Added disclaimer. Edit: 2014-12-01: Moved disclaimer to top.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With