I have a language that defines a string as being delimited by either single or double quotes, where the delimiter is escaped within the string by doubling it. For example, all of the following are legal strings:
'This isn''t easy to parse.'
'Then John said, "Hello Tim!"'
"This isn't easy to parse."
"Then John said, ""Hello Tim!"""
I have a collection of strings (defined above), delimited by something that doesn't contain a quote. What I am attempting to do using regular expressions, is to parse each string in a list out. For example, here is an input:
"Some String #1" OR 'Some String #2' AND "Some 'String' #3" XOR
'Some "String" #4' HOWDY "Some ""String"" #5" FOO 'Some ''String'' #6'
The regular expression to determine whether a string is of such a form is trivial:
^(?:"(?:[^"]|"")*"|'(?:[^']|'')*')(?:\s+[^"'\s]+\s+(?:"(?:[^"]|"")*"|'(?:[^']|'')*')*
After running the above expression to test whether it is of such a form, I need another regular expression to get each delimited string from the input. I plan to do this as follows:
Pattern pattern = Pattern.compile("What REGEX goes here?");
Matcher matcher = pattern.matcher(inputString);
int startIndex = 0;
while (matcher.find(startIndex))
{
String quote = matcher.group(1);
String quotedString = matcher.group(2);
...
startIndex = matcher.end();
}
I would like a regular expression that captures the quote character in group #1, and the text within quotes in group #2 (I am using Java Regex). So, for the above input, I am looking for a regular expression that produces the following output within each loop iteration:
Loop 1: matcher.group(1) = "
matcher.group(2) = Some String #1
Loop 2: matcher.group(1) = '
matcher.group(2) = Some String #2
Loop 3: matcher.group(1) = "
matcher.group(2) = Some 'String' #3
Loop 4: matcher.group(1) = '
matcher.group(2) = Some "String" #4
Loop 5: matcher.group(1) = "
matcher.group(2) = Some ""String"" #5
Loop 6: matcher.group(1) = '
matcher.group(2) = Some ''String'' #6
Patterns I have tried thus far (un-escaped, followed by escaped for Java code):
(["'])((?:[^\1]|\1\1)*)\1
"([\"'])((?:[^\\1]|\\1\\1)*)\\1"
(?<quot>")(?<val>(?:[^"]|"")*)"|(?<quot>')(?<val>(?:[^']|'')*)'
"(?<quot>\")(?<val>(?:[^\"]|\"\")*)\"|(?<quot>')(?<val>(?:[^']|'')*)'"
Both of these fail when trying to compile the pattern.
Is such a regular expression possible?
Firstly, double quote character is nothing special in regex - it's just another character, so it doesn't need escaping from the perspective of regex. However, because Java uses double quotes to delimit String constants, if you want to create a string in Java with a double quote in it, you must escape them.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
Try putting a backslash ( \ ) followed by " .
If you need to use the double quote inside the string, you can use the backslash character. Notice how the backslash in the second line is used to escape the double quote characters. And the single quote can be used without a backslash.
Make a utility class that matches for you:
class test {
private static Pattern pd = Pattern.compile("(\")((?:[^\"]|\"\")*)\"");
private static Pattern ps = Pattern.compile("(')((?:[^']|'')*)'");
public static Matcher match(String s) {
Matcher md = pd.matcher(s);
if (md.matches()) return md;
else return ps.matcher(s);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With