Regex/token/rule to match nested curly braces?

Tags:

I need to match the values of key = value pairs in BibTeX files, which can contain arbitrarily nested braces, delimited by braces. I've got as far as matching at most two deep nested curly braces, like {some {stuff} like {this}} with the kludgey:

    token brace-value {
    '{' <-[{}]>* ['{' <-[}]>* '}' <-[{}]>* ]* '}'
    }

I shudder at the idea of going one level further down... but proper parsing of my BibTeX stuff needs at least three levels deep.

Yes, I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile. My *.bib files are rather tame (and I wouldn't mind to handle a few stray entries by hand), the problem is that I have a lot of them, with much overlap. But some of the "same" entries have different keys, or extra data. I want to consolidate them into a few master files (the whole idea behind BibTeX, right?). Not fun by hand if bibtool gives a file with no duplicates (ha!) of some 20 thousand lines...

446

asked Jul 17 '20 03:07

vonbrand

2 Answers

After perusing Lenz' "Parsing with Perl 6 Regexes and Grammars" (Apress, 2017), I realized the "regex" machinery (based on backtracking) might actually be a lot more capable than officially admitted, as a regex can call another, and nowhere do I see a prohibition on recursive calls.

Before digging in, a bit of context free grammars: A way to describing nested braces (and nothing else) is with the grammar:

S -> { S } S | <nothing>

I.e., nested braces are either an opening brace, nested braces, a closing brace, more nested braces; or nothing whatsoever. This translates more or less directly to Raku (there is no empty regex, fake it by making the construction optional):

my regex nb {
   [ '{' <nb> '}' <nb> ]?
}

Lo and behold, this works. Need to fix up to avoid captures, kill backtracking (if it doesn't match on the first try, it won't ever match), and decorate with "anything else" fillers.

my regex nested-braces {
    :ratchet 
     <-[{}]>*
     [ '{' <.nested-braces> '}' <.nested-braces> ]?
     <-[{}]>*
};

This checks out with my test cases.

For not-so-adventurous souls, there is the Text::Balanced module for Perl (formerly Perl 5, callable from Raku using Inline::Perl5). Not directly useful to me inside a grammar, unfortunately.

answered Oct 16 '22 10:10

vonbrand

Solution

A way to describe nested braces (and nothing else)

Presuming a rule named &R, I'd likely write the following pattern if I was writing a quick small one-off script:

\{ <&R>* \}

If I was writing a larger program that should be maintainable I'd likely be writing a grammar and, using a rule named R the pattern would be:

'{' ~ '}' <R>*

This latter avoids leaning toothpick syndrome and uses the regex ~ operator.

These will both parse arbitrarily deeply nested paired braces, eg:

say '{{{{}}}}' ~~ token { \{ <&?ROUTINE>* \} } # ｢{{{{}}}}｣

(&?ROUTINE refers to the routine in which it appears. A regex is a routine. (Though you can't use <&?ROUTINE> in a regex declared with / ... / syntax.)

`regex` vs `token`

kill backtracking

my regex nested-braces {
    :ratchet

The only difference between patterns declared with regex and token is that the former turns ratcheting off. So using it and then immediately turning ratcheting on is notably unidiomatic. Instead:

my token nested-braces {

Backtracking

the "regex" machinery (based on backtracking)

The grammar/regex engine does include backtracking as an optional feature because that's occasionally exactly what one wants.

But the engine is not "based on backtracking", and many grammars/parsers make little or no use of backtracking.

Recursion

a regex can call another, and nowhere do I see a prohibition on recursive calls.

This alone is nothing special for contemporary regex engines.

PCRE has supported recursion since 2000, and named regexes since 2003. Perl's default regex engine has supported both since 2007.

Their support for deeper levels of recursion and more named regexes being stored at once has been increasing over time.

Damian Conway's PPR uses these features of regexes to build non-trivial (but still small) parse trees.

Capabilities

a lot more capable

Raku "regexes" can be viewed as a cleaned up take on the unfolding regex evolution. To the degree this helps someone understand them, great.

But really, it's a whole new deal. For example, they're turing complete, in a sensible way, and thus able to parse anything.

than officially admitted

Well that's an odd thing to say! Raku's Grammars are frequently touted as one of Raku's most innovative features.

There are three major caveats:

Performance The primary current caveat is that a well written C parser will blow the socks off a well written Raku Grammar based parser.
Pay off It's often not worth the effort it takes to write a fully correct parser for a non-trivial format if there's an existing parser.
Left recursion Raku does not automatically rewrite left recursion (infinite loops).

Using existing parsers

I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile.

Using a foreign module in Raku can be a bit of a revelation. It is not necessarily like anything you'll have experienced before. Raku's foreign language adaptors can do smart marshaling for you so it can be like you're using native Raku features.

Two of the available foreign language adaptors are already sufficiently polished to be amazing -- the ones for Perl and for C.

I'm pretty sure there's a BibTeX package for Perl that wraps a C BibTeX parser. If you used that you'd hopefully get parsing results all nicely wrapped up into Raku objects as if it was all Raku in the first place, but retaining much of the high performance of the C code.

A Raku BibTeX Grammar?

Perhaps your needs do call for creating and using a small Raku Grammar.

(Maybe you're doing this partly as an exercise to familiarize yourself with Raku, or the regex/grammar aspect of Raku. For that it sounds pretty ideal.)

As soon as you begin to use multiple regexes together -- even just two -- you are closing in on grammar territory. After all, they're just an easy-to-use construct for using multiple regexes together.

So if you decide you want to stick with writing parsing code in Raku, expect to write it something like this:

grammar BiBTeX {
  token TOP { ... }
  token ...
  token ...
}
BiBTeX.parse: my-bib-file

For more details, see the official doc's Grammar tutorial or read Moritz's book.

answered Oct 16 '22 12:10

raiph

Related questions
                            
                                ANTLR4: Using non-ASCII characters in token rules
                            
                                Anti-matching against an infinite family of <!before> patterns in Raku
                            
                                Call Method from Constructor: Error: Uncaught TypeError: undefined is not a function
                            
                                How to remove every word with non alphabetic characters
                            
                                Perl6 grammars: match full line
                            
                                Removing left recursion
                            
                                Convert regular expression to CFG
                            
                                Parser errors - pattern for generating error handling automatically
                            
                                C++ create a parser [closed]
                            
                                Why is this c# snippet legal?
                            
                                Why can't anything go in the body of a Perl 6 grammar proto?
                            
                                What is difference between : and :: and ::: in Javascript grammar
                            
                                Build parser from grammar at runtime
                            
                                Are the grammars of modern programming languages context-free or context-sensitive?
                            
                                Why does <!-- Not Throw a Syntax Error?
                            
                                Example of Perl 6 grammar with operator precedence rules
                            
                                Resolving reduce/reduce conflict in yacc/ocamlyacc
                            
                                customising the parse return value, retaining unnamed terminals
                            
                                How to use Grammar::Tracer with a unit scoped grammar?
                            
                                Shift reduce conflict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex/token/rule to match nested curly braces?

Tags:

grammar

raku

vonbrand

People also ask

2 Answers

vonbrand

Solution

`regex` vs `token`

Backtracking

Recursion

Capabilities

Using existing parsers

A Raku BibTeX Grammar?

raiph

Recent Activity

Donate For Us

Regex/token/rule to match nested curly braces?

Tags:

grammar

raku

vonbrand

People also ask

2 Answers

vonbrand

Solution

regex vs token

Backtracking

Recursion

Capabilities

Using existing parsers

A Raku BibTeX Grammar?

raiph

Related questions

Recent Activity

Donate For Us

`regex` vs `token`