Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to parse an array of JSON objects?

Tags:

json

c#

.net

regex

I'm trying to parse an array of JSON objects into an array of strings in C#. I can extract the array from the JSON object, but I can't split the array string into an array of individual objects.

What I have is this test string:

string json = "{items:[{id:0,name:\"Lorem Ipsum\"},{id:1,name" 
            + ":\"Lorem Ipsum\"},{id:2,name:\"Lorem Ipsum\"}]}";

Right now I'm using the following regular expressions right now to split the items into individual objects. For now they're 2 separate regular expressions until I fix the problem with the second one:

Regex arrayFinder = new Regex(@"\{items:\[(?<items>[^\]]*)\]\}"
                                 , RegexOptions.ExplicitCapture);
Regex arrayParser = new Regex(@"((?<items>\{[^\}]\}),?)+"
                                 , RegexOptions.ExplicitCapture);

The arrayFinder regex works the way I'd expect it but, for reasons I don't understand, the arrayParser regex doesn't work at all. All I want it to do is split the individual items into their own strings so I get a list like this:

{id:0,name:"Lorem Ipsum"}
{id:1,name:"Lorem Ipsum"}
{id:2,name:"Lorem Ipsum"}

Whether this list is a string[] array or a Group or Match collection doesn't matter, but I'm stumped as to how to get the objects split. Using the arrayParser and the json string declared above, I've tried this code which I assumed would work with no luck:

string json = "{items:[{id:0,name:\"Lorem Ipsum\"},{id:1,name" 
            + ":\"Lorem Ipsum\"},{id:2,name:\"Lorem Ipsum\"}]}";

Regex arrayFinder = new Regex(@"\{items:\[(?<items>[^\]]*)\]\}"
                                 , RegexOptions.ExplicitCapture);
Regex arrayParser = new Regex(@"((?<items>\{[^\}]\}),?)+"
                                 , RegexOptions.ExplicitCapture);

string array = arrayFinder.Match(json).Groups["items"].Value;
// At this point the 'array' variable contains: 
// {id:0,name:"Lorem Ipsum"},{id:1,name:"Lorem Ipsum"},{id:2,name:"Lorem Ipsum"}

// I would have expected one of these 2 lines to return 
// the array of matches I'm looking for
CaptureCollection c = arrayParser.Match(array).Captures;
GroupCollection g = arrayParser.Match(array).Groups;

Can anybody see what it is I'm doing wrong? I'm totally stuck on this.

like image 710
Dan Herbert Avatar asked Jan 03 '09 03:01

Dan Herbert


1 Answers

Balanced parentheses are literally a textbook example of a language that cannot be processed with regular expressions. JSON is essentially balanced parentheses plus a bunch of other stuff, with the braces replaced by parens. In the hierarchy of formal languages, JSON is a context-free language. Regular expressions can't parse context-free languages.

Some systems offer extensions to regular expressions that kinda-sorta handle balanced expressions. However they're all ugly hacks, they're all unportable, and they're all ultimately the wrong tool for the job.

In professional work, you would almost always use an existing JSON parser. If you want to roll your own for educational purposes then I'd suggest starting with a simple arithmetic grammar that supports + - * / ( ). (JSON has some escaping rules which, while not complex, will make your first attempt harder than it needs to be.) Basically, you'll need to:

  1. Decompose the language into an alphabet of symbols
  2. Write a context-free grammar in terms of those symbols thatrecognizes the language
  3. Convert the grammar into Chomsky normal form, or near enough to make step 5 easy
  4. Write a lexer that converts raw text into your input alphabet
  5. Write a recursive descent parser that takes your lexer's output, parses it, and produces some kind of output

This is a typical third-year CS assignment at just about any university.

The next step is to find out how complex a JSON string you need to trigger a stack overflow in your recursive parser. Then look at the other types of parsers that can be written, and you'll understand why anyone who has to parse a context-free language in the real world uses a tool like yacc or antlr instead of writing a parser by hand.

If that's more learning than you were looking for then you should feel free to go use an off-the-shelf JSON parser, satisified that you learned something important and useful: the limits of regular expressions.

like image 147
user51099 Avatar answered Nov 14 '22 23:11

user51099