Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting on comma outside quotes [duplicate]

My program reads a line from a file. This line contains comma-separated text like:

123,test,444,"don't split, this",more test,1 

I would like the result of a split to be this:

123 test 444 "don't split, this" more test 1 

If I use the String.split(","), I would get this:

123 test 444 "don't split  this" more test 1 

In other words: The comma in the substring "don't split, this" is not a separator. How to deal with this?

like image 511
Jakob Mathiasen Avatar asked Sep 19 '13 11:09

Jakob Mathiasen


People also ask

How do you split a string with double quotes?

Use method String. split() It returns an array of String, splitted by the character you specified.

How do you split quotation marks?

Question marks and exclamation marks go inside the quotation marks when they are part of the original quotation. For split quotations, it's also necessary to add a comma after the first part of the quotation and after the narrative element (just like you would with a declarative quotation).

How does CSV handle extra commas in Java?

You need to specify text qualifiers. Generally a double quote (") is used as text qualifiers. All the text is always put inside it and all the commas inside a text qualifier is ignored. This is a standard method for all CSV, languages and all platforms for properly handling the text.

How do you ignore a comma in a string in python?

replace your string. split(",") by string. split(", ") with a space after the comma. This should be enough to avoid splitting the numbers.


2 Answers

You can try out this regex:

str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"); 

This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.

Explanation:

,           // Split on comma (?=         // Followed by    (?:      // Start a non-capture group      [^"]*  // 0 or more non-quote characters      "      // 1 quote      [^"]*  // 0 or more non-quote characters      "      // 1 quote    )*       // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)    [^"]*    // Finally 0 or more non-quotes    $        // Till the end  (This is necessary, else every comma will satisfy the condition) ) 

You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:

String[] arr = str.split("(?x)   " +                       ",          " +   // Split on comma                      "(?=        " +   // Followed by                      "  (?:      " +   // Start a non-capture group                      "    [^\"]* " +   // 0 or more non-quote characters                      "    \"     " +   // 1 quote                      "    [^\"]* " +   // 0 or more non-quote characters                      "    \"     " +   // 1 quote                      "  )*       " +   // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)                      "  [^\"]*   " +   // Finally 0 or more non-quotes                      "  $        " +   // Till the end  (This is necessary, else every comma will satisfy the condition)                      ")          "     // End look-ahead                          ); 
like image 138
Rohit Jain Avatar answered Sep 28 '22 20:09

Rohit Jain


Why Split when you can Match?

Resurrecting this question because for some reason, the easy solution wasn't mentioned. Here is our beautifully compact regex:

"[^"]*"|[^,]+ 

This will match all the desired fragments (see demo).

Explanation

  • With "[^"]*", we match complete "double-quoted strings"
  • or |
  • we match [^,]+ any characters that are not a comma.

A possible refinement is to improve the string side of the alternation to allow the quoted strings to include escaped quotes.

like image 32
zx81 Avatar answered Sep 28 '22 20:09

zx81