I am working on a project in Java that requires having nested strings.
For an input string that in plain text looks like this:
This is "a string" and this is "a \"nested\" string"
The result must be the following:
[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"
Note that I want the \"
sequences to be kept.
I have the following method:
public static String[] splitKeepingQuotationMarks(String s);
and I need to create an array of strings out of the given s
parameter by the given rules, without using the Java Collection Framework or its derivatives.
I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?
UPDATE based on questions from comments:
"
has its closing unescaped "
(they are balanced)\
also must be escaped if we want to create literal representing it (to create text representing \
we need to write it as \\
).How do I split a string based on space but take quoted Substrings as one word? \S* - followed by zero or more non-space characters.
Question marks and exclamation marks go inside the quotation marks when they are part of the original quotation. For split quotations, it's also necessary to add a comma after the first part of the quotation and after the narrative element (just like you would with a declarative quotation).
You can use the following regex:
"[^"\\]*(?:\\.[^"\\]*)*"|\S+
See the regex demo
Java demo:
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Explanation:
"[^"\\]*(?:\\.[^"\\]*)*"
- a double quote that is followed with any 0+ characters other than a "
and \
([^"\\]
) followed with 0+ sequences of any escaped sequence (\\.
) followed with any 0+ characters other than a "
and \
|
- or...\S+
- 1 or more non-whitespace charactersNOTE
@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+"
(or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+"
would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *
. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.
UPDATE
Since String[]
type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:
int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
result[idx] = matcher.group(0);
idx++;
}
System.out.println(Arrays.toString(result));
See another IDEONE demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With