I am making a key-value parser where the input string takes the form of key:"value",key2:"value". Keys can contain the characters a-z, A-Z and 0-9 and values can contain any character but :, ,," and \ need to be prefixed with a backslash. Commas are used to separate the key-value pairs but are not needed after the last pair.
So far I have ([a-zA-Z0-9]+):"(.*)" which will match most keys and values but obviously it wont be able to handle more than a single pair or if any of the 'control' characters go unescaped. (?<=\\)[:,"\\] seems to match all escaped characters but it will not match any 'normal' characters.
Is there a way to check for comma separation and to match all escaped 'control' characters as well as normal ones? Is this something that would be better suited to implementation without regex or would this need multiple patterns in sequence?
Some examples:
input: joe:"bread",sam:"fish" output: joe -> bread sam -> fish
input: joe:"Look over there\, it's a shark!",sam:"I like fish." output: joe -> Look over there, it's a shark! sam -> I like fish
You could use the below regex to get the key value pair.
([a-zA-Z0-9]+):"(.*?)(?<!\\)"
OR
([a-zA-Z0-9]+):"(.*?)"(?=,[a-zA-Z0-9]+:"|$)
DEMO
Java regex would be,
"([a-zA-Z0-9]+):\"(.*?)(?<!\\\\)\""
(?<!\\)" negative lookbehind asserts that the double quotes won't be preceeded by a backslash character. In java, to match a backslash character, you need to escape the backslash in your pattern exactly three times, ie, \\\\
DEMO
String s = "joe:\"Look over there\\, it's a shark!\",sam:\"I like fish.\"";
Matcher m = Pattern.compile("([a-zA-Z0-9]+):\"(.*?)(?<!\\\\)\"").matcher(s);
while(m.find())
{
System.out.println(m.group(1) + " --> " + m.group(2));
}
}
Output:
joe --> Look over there\, it's a shark!
sam --> I like fish.
OR
String s = "joe:\"Look over there\\, i\\\"t's a shark!\",sam:\"I like fish.\"";
Matcher m = Pattern.compile("([a-zA-Z0-9]+):\"((?:\\\\\"|[^\"])*)\"").matcher(s);
while(m.find())
{
System.out.println(m.group(1) + " --> " + m.group(2));
}
}
Output:
joe --> Look over there\, i\"t's a shark!
sam --> I like fish.
Assuming that \ followed by any character except for line terminator specifies the character immediately following it.
You can use the following regex to match all instances of key-value pairs:
"([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\""
Add \\s* before and after : if you want to allow free spacing.
This is what the regex engine sees:
([a-zA-Z0-9]+):"((?:[^\\"]|\\.)*+)"
The quantifier * is made possessive *+, since the 2 branches [^\\"] and \\. are mutual exclusive (no string can be matched by both at the same time). It also avoids StackOverflowError in the Oracle's implementation of Pattern class.
Use the regex above in a Matcher loop:
Pattern keyValuePattern = Pattern.compile("([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\"");
Matcher matcher = keyValuePattern.matcher(inputString);
while (matcher.find()) {
String key = matcher.group(1);
// Process the escape sequences in the value string
String value = matcher.group(2).replaceAll("\\\\(.)", "$1");
// ...
}
In general case, depending on the complexity of the escape sequences (e.g. \n, \uhhhh, \xhh, \0), you might want to write a separate function to parse them. However, with the assumption above, the one-liner suffices.
Note that this solution doesn't care about the separators, though. And it will skip on invalid input to the nearest match. In the example of invalid input below, the solution above will skip abc:" at the beginning and happily match xyz:"text text" amd more:"pair" as key-value pairs:
abc:"xyz:"text text", more:"pair"
If this behavior is not desirable, there is a solution, but the string containing all the key-value pairs must be isolated first, instead of being part of a bigger string that doesn't have anything to do with key-value pairs:
"(?:^|(?!^)\\G,)([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\""
Free-spacing version:
"(?:^\s*|(?!^)\\G\s*,\s*)([a-zA-Z0-9]+)\s*:\s*\"((?:[^\\\\\"]|\\\\.)*+)\""
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With