I know Perl's "Marpa" Earley parser has very good error reporting.
But I can't find in its documentation or via Googling whether it has error recovery.
For instance, most C/C++ compilers have error recovery, which they use to report multiple syntax errors where often other compilers stop at the first error.
I'm actually parsing natural language and wonder if there's a way to re-sync and resume parsing after one part of the input fails.
Example, for those who can grok it:
I'm parsing syllables in the Lao language. In Lao some vowels are diacritics which are encoded as separate characters and rendered above the previous consonant. In parsing random articles from the Lao Wikipedia I ran into some text where such a vowel was doubled. This is not allowed in Lao orthography so must be a typo. But I know that within a couple of characters the text is good again.
Anyway this is the real example which piqued my general interest in error recovery or re-synchronizing with the token stream.
There are two possibilities for handling mistakes in Marpa.
Marpa maintains a lot of context during scanning. We can use this context so that the parser can require some token, and we can decide whether we want to offer it to Marpa even if it isn't in the input. Consider for example a programming language that requires any statement to be terminated by a semicolon. We can then use Ruby Slippers techniques to insert semicolons at specific locations, such as at the end of a line, or before a closing brace:
use strict;
use warnings;
use Marpa::R2;
use Data::Dump 'dd';
my $grammar = Marpa::R2::Scanless::G->new({
source => \q{
:discard ~ ws
Block ::= Statement+ action => ::array
Statement ::= StatementBody (STATEMENT_TERMINATOR) action => ::first
StatementBody ::= 'statement' action => ::first
| ('{') Block ('}') action => ::first
STATEMENT_TERMINATOR ~ ';'
event ruby_slippers = predicted STATEMENT_TERMINATOR
ws ~ [\s]+
},
});
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
my $input = q(
statement;
{ statement }
statement
statement
);
for (
$recce->read(\$input);
$recce->pos < length $input;
$recce->resume
) {
ruby_slippers($recce, \$input);
}
ruby_slippers($recce, \$input);
dd $recce->value;
sub ruby_slippers {
my ($recce, $input) = @_;
my %possible_tokens_by_length;
my @expected = @{ $recce->terminals_expected };
for my $token (@expected) {
pos($$input) = $recce->pos;
if ($token eq 'STATEMENT_TERMINATOR') {
# fudge a terminator at the end of a line, or before a closing brace
if ($$input =~ /\G \s*? (?: $ | [}] )/smxgc) {
push @{ $possible_tokens_by_length{0} }, [STATEMENT_TERMINATOR => ';'];
}
}
}
my $max_length = 0;
for (keys %possible_tokens_by_length) {
$max_length = $_ if $_ > $max_length;
}
if (my $longest_tokens = $possible_tokens_by_length{$max_length}) {
for my $lexeme (@$longest_tokens) {
$recce->lexeme_alternative(@$lexeme);
}
$recce->lexeme_complete($recce->pos, $max_length);
return ruby_slippers($recce, $input);
}
}
In the ruby_slippers
function, you could also count how often you needed to fudge a token. If that count exceeds some value, you could abandon the parse by throwing an error.
If your input may contain unparseable junk, you can try skipping that if no lexeme would otherwise be found. For this, the $recce->resume
method takes an optional position argument, where the normal parsing will resume.
use strict;
use warnings;
use Marpa::R2;
use Data::Dump 'dd';
use Try::Tiny;
my $grammar = Marpa::R2::Scanless::G->new({
source => \q{
:discard ~ ws
Sentence ::= WORD+ action => ::array
WORD ~ 'foo':i | 'bar':i | 'baz':i | 'qux':i
ws ~ [\s]+
},
});
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
my $input = '1) Foo bar: baz and qux, therefore qux (foo!) implies bar.';
try { $recce->read(\$input) };
while ($recce->pos < length $input) {
# ruby_slippers($recce, \$input);
try { $recce->resume } # restart at current position
catch { try { $recce->resume($recce->pos + 1) } }; # advance the position
# if both fail, we go into a new iteration of the loop.
}
dd $recce->value;
While the same effect could be achieved with a :discard
lexeme that matches anything, doing the skipping in our client code allows us to abort the parse if too much fudging had to be done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With