Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a nested string keeping quotation marks

Tags:

java

string

regex

I am working on a project in Java that requires having nested strings.

For an input string that in plain text looks like this:

This is "a string" and this is "a \"nested\" string"

The result must be the following:

[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"

Note that I want the \" sequences to be kept.
I have the following method:

public static String[] splitKeepingQuotationMarks(String s);

and I need to create an array of strings out of the given s parameter by the given rules, without using the Java Collection Framework or its derivatives.

I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?

UPDATE based on questions from comments:

  • each unescaped " has its closing unescaped " (they are balanced)
  • each escaping character \ also must be escaped if we want to create literal representing it (to create text representing \ we need to write it as \\).
like image 833
bobasti Avatar asked Mar 29 '16 18:03

bobasti


People also ask

How do I split a string based on space but take quoted Substrings as one word?

How do I split a string based on space but take quoted Substrings as one word? \S* - followed by zero or more non-space characters.

How do you split quotation marks?

Question marks and exclamation marks go inside the quotation marks when they are part of the original quotation. For split quotations, it's also necessary to add a comma after the first part of the quotation and after the narrative element (just like you would with a declarative quotation).


1 Answers

You can use the following regex:

"[^"\\]*(?:\\.[^"\\]*)*"|\S+

See the regex demo

Java demo:

String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Explanation:

  • "[^"\\]*(?:\\.[^"\\]*)*" - a double quote that is followed with any 0+ characters other than a " and \ ([^"\\]) followed with 0+ sequences of any escaped sequence (\\.) followed with any 0+ characters other than a " and \
  • | - or...
  • \S+ - 1 or more non-whitespace characters

NOTE

@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+" (or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+" would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.

UPDATE

Since String[] type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:

int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
    result[idx] = matcher.group(0);
    idx++;
}
System.out.println(Arrays.toString(result));

See another IDEONE demo

like image 108
Wiktor Stribiżew Avatar answered Oct 22 '22 19:10

Wiktor Stribiżew