I'm implementing some kind of parser and I need to locate and deserialize json object embedded into other semi-structured data. I used regexp:
\\{\\s*title.*?\\}
to locate object
{title:'Title'}
but it doesn't work with nested objects because expression matches only first found closing curly bracket. For
{title:'Title',{data:'Data'}}
it matches
{title:'Title',{data:'Data'}
so string becomes invalid for deserialization. I understand that there's a greedy business coming into account but I'm not familiar with regexps. Could you please help me to extend expression to consume all available closing curly brackets.
Update:
To be clear, this is an attempt to extract JSON data from semi-structured data like HTML+JS with embedded JSON. I'm using GSon JAVA lib to actually parse extracted JSON.
This recursive Perl/PCRE regular expression should be able to match any valid JSON or JSON5 object, including nested objects and edge cases such as braces inside JSON strings or JSON5 comments:
/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/
Of course, that's a bit hard to read, so you might prefer the commented version:
m{
( # Begin capture group (matching a JSON object).
\{ # Match opening brace for JSON object.
(?: # Begin non-capturing group to contain alternations.
(?>[^{}"'\/]+) # Match a non-empty string which contains no braces, quotes or slashes, without backtracking.
| # Alternation; next alternative follows.
(?>"(?:(?>[^\\"]+)|\\.)*") # Match a double-quoted JSON string, without backtracking.
| # Alternation; next alternative follows.
(?>'(?:(?>[^\\']+)|\\.)*') # Match a single-quoted JSON5 string, without backtracking.
| # Alternation; next alternative follows.
(?>\/\/.*\n) # Match a single-line JSON5 comment, without backtracking.
| # Alternation; next alternative follows.
(?>\/\*.*?\*\/) # Match a multi-line JSON5 comment, without backtracking.
| # Alternation; next alternative follows.
(?-1) # Recurse to most recent capture group, to match a nested JSON object.
)* # End of non-capturing group; match zero or more repetitions of this group.
\} # Match closing brace for JSON object.
) # End of capture group (matching a JSON object).
}x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With