Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find commas that aren't inside "( and )"

Tags:

java

regex

I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:

,a,b,c,d,"("x","y",z)",e,f,g,

Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.

I tried a lot of combinations but regular expressions is still a little foggy for me.

I want it to use with the split method in Java. The example is short, but it can be much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.

So, want to do something like this:

String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g

Thanks!

like image 356
Alaor Avatar asked Aug 07 '10 00:08

Alaor


3 Answers

You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:

String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";

String[] parts = text.split(";(?![^<>]*>)");

System.out.println(java.util.Arrays.toString(parts));
//  _  _  _  _  _______  _  _  _  _________  _  _  _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]

Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.


On the pattern

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.

The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.

The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.

Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.

This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.

You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).

References

  • regular-expressions.info/Character class, Repetition, Lookarounds, Possessive
like image 87
polygenelubricants Avatar answered Sep 20 '22 20:09

polygenelubricants


Try this one:

(?![^(]*\)),

It worked for me in my testing, grabbed all commas not inside parenthesis.

Edit: Gopi pointed out the need to escape the slashes in Java:

(?![^(]*\\)),

Edit: Alan Moore pointed out some unnecessary complexity. Fixed.

like image 33
cnanney Avatar answered Sep 23 '22 20:09

cnanney


If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.

List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
  if ((i % 2) == 0) {
    String[] atoms = chunks[i].split(",");
    for (int j = 0; j < atoms.length; j++)
      result.add(atoms[j]);
  }
  else
    result.add(chunks[i]);
}
like image 29
Adam Schmideg Avatar answered Sep 21 '22 20:09

Adam Schmideg