Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match nested json objects

Tags:

java

regex

I'm implementing some kind of parser and I need to locate and deserialize json object embedded into other semi-structured data. I used regexp:

\\{\\s*title.*?\\}

to locate object

{title:'Title'}

but it doesn't work with nested objects because expression matches only first found closing curly bracket. For

{title:'Title',{data:'Data'}}

it matches

{title:'Title',{data:'Data'}

so string becomes invalid for deserialization. I understand that there's a greedy business coming into account but I'm not familiar with regexps. Could you please help me to extend expression to consume all available closing curly brackets.

Update:

To be clear, this is an attempt to extract JSON data from semi-structured data like HTML+JS with embedded JSON. I'm using GSon JAVA lib to actually parse extracted JSON.

like image 813
Viktor Stolbin Avatar asked Jul 22 '13 11:07

Viktor Stolbin


1 Answers

This recursive Perl/PCRE regular expression should be able to match any valid JSON or JSON5 object, including nested objects and edge cases such as braces inside JSON strings or JSON5 comments:

/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/

Of course, that's a bit hard to read, so you might prefer the commented version:

m{
  (                               # Begin capture group (matching a JSON object).
    \{                              # Match opening brace for JSON object.
    (?:                             # Begin non-capturing group to contain alternations.
      (?>[^{}"'\/]+)                  # Match a non-empty string which contains no braces, quotes or slashes, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>"(?:(?>[^\\"]+)|\\.)*")      # Match a double-quoted JSON string, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>'(?:(?>[^\\']+)|\\.)*')      # Match a single-quoted JSON5 string, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>\/\/.*\n)                    # Match a single-line JSON5 comment, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>\/\*.*?\*\/)                 # Match a multi-line JSON5 comment, without backtracking.
    |                               # Alternation; next alternative follows.
      (?-1)                           # Recurse to most recent capture group, to match a nested JSON object.
    )*                              # End of non-capturing group; match zero or more repetitions of this group.
    \}                              # Match closing brace for JSON object.
  )                               # End of capture group (matching a JSON object).
}x
like image 150
Deven T. Corzine Avatar answered Oct 05 '22 01:10

Deven T. Corzine