I have a simple grammar, and I am using it to parse some text. The text is user inputted, but my program guarantees that it stars with a match to the grammar. (ie, if my grammar matched only a
, the text might be abc
or a
or a_
.) However, when I use the .parse
method on my grammar, it fails on any non-exact match. How can I perform a partial match?
A partial match is one that matched one or more characters at the end of the text input, but did not match all of the regular expression (although it may have done so had more input been available).
(A partial match occurs if the whole of the element of x matches the beginning of the element of table .) Finally, all remaining elements of x are regarded as unmatched. In addition, an empty string can match nothing, not even an exact match to an empty string.
The partial match feature allows the index to return items that only contain a subset of the keywords entered by the end user. 1. This ensures that relevant items which only contain some of the query keywords are returned, and reduces the chance of receiving no results in the response.
In Raku, Grammar.parse
has to match the whole string. This is what causes it to fail if your grammar would only match a
in the string abc
. To allow matching only part of the input string, you can use Grammar.subparse
instead.
grammar Foo {
token TOP { 'a' }
}
my $string = 'abc';
say Foo.parse($string); # Nil
say Foo.subparse($string); # 「a」
The input string will need to start with the potential Match
. Otherwise, you will get a failed match.
say Foo.subparse('cbacb'); # #<failed match>
You can work around this using a Capture marker.
grammar Bar {
token TOP {
<-[a]>* # Match 0 or more characters that are *not* a
<( 'a' # Start the match, and match a single 'a'
}
}
say Bar.parse('a'); # 「a」
say Bar.subparse('a'); # 「a」
say Bar.parse('abc'); # Nil
say Bar.subparse('abc'); # 「a」
say Bar.parse('cbabc'); # Nil
say Bar.subparse('cbabc'); # 「a」
This works because <-[a]>*
, a character class that includes any character except the letter a
, will consume all the characters before a potential a
. However, the Capture marker will cause these to be dropped from the eventual Match
object, leaving you with just the a
you wanted to match.
TL;DR
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
# Partial match anchored to end of string:
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
There are traditionally two takes on the general notion of text "matching":
"Parsing"
"Regexes"
Raku:
Provides a unified text pattern language and engine that do both jobs.
Makes it easy to stick to one perspective, or other, or blend them, or refactor between them, as suits an individual dev and/or individual use case.
Takes "parsing" to mean more or less a single match starting at the start of the input string whereas "regexes" are much more flexible.
What you've written in your question and your first comment on Tyil's answer reflects the inherent ambiguity of the topic. I'll provide two answers rather than one to try help you and/or other readers to be clearer about Raku's use of vocabulary, and your options functionality wise.
.parse
et alYou began with:
Partial match in a grammar ... I have a simple grammar ... my program guarantees that it starts with a match to the grammar
With that in mind, here's your question:
How can I perform a partial match?
The phrases "guarantees that it starts" and "partial match" are ambiguous.
One take is that you want what I'll call a "prefix" match, matching one or more characters anchored from the start of the string, and not merely any sub-string starting and ending anywhere in the input string.
This nicely fits with "parsing", or at least Raku's use of the word in its grammar methods.
All the built in Grammar
methods with parse
in their name insert an anchor to the start of the string in whatever grammar rule they use to start the parsing process. You cannot remove that anchor. This reflects the choice of vocabulary; "parse" is taken to mean matching from the start no matter what else happens.
The parse method for this "prefix" scenario is .subparse
:
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
See also:
Search of SO for "[raku] subparse".
raku doc for .subparse
.
But perhaps "guarantees that it starts" and "partial match" did not mean that you wanted anchoring at the start. Your comment on Tyil's answer highlights this ambiguity:
Will
.subparse
only match at the start, or match anywhere in the string?
Tyil provides a workaround. You can do what Tyil shows, but it'll only match if the very first a
encountered in the input string is the one that's at the start of the sub-string you want your "parse" to match.
If instead the first a
was a false positive, and there was a second or a subsequent a
you wanted the "parse" match to start at, then, at least in the Raku world, it's helpful to call that "regexing" rather than "parsing" and to use "regex" matching via the ~~
smartmatch operator.
~~
Raku lets you do unlimited partial matching if you use its ~~
construct with a regex.
For example, you could write:
# End of match at end of string:
↓
say 'abcaa' ~~ token { a* $ } # 「aa」
~~
with a regex tells Raku to:
Try match starting at the first character position in the string on the LHS;
If that fails, step forward one character, and try again, with the new position in the input string treated as a fresh starting point;
Repeat that until either matching once, or failing to find any match in the entire string.
Here I've left the start position of the match unspecified (which ~~
takes to mean it can be anywhere in the string) and anchored the end of the pattern to the end of the input string. So it successfully matches the aa
at the end of the string.
This anchoring freedom illustrates just one of the many ways that ~~
smart matching provides much greater matching flexibility than using the parse
methods.
If you have an existing grammar you can still use that:
grammar foo { token TOP { a* } }
# Anchor matching to end of string:
↓
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
You have to name both the grammar and the rule within it you wish to invoke and put them inside <...>
. And you need to insert a .
to avoid a correspondingly named sub-capture, presuming you don't want that.
Here's another example:
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
"Parsing" in Raku always starts at the beginning of an input string and results in either no match or one match.
In contrast, a "regex" can match arbitrary fragments, and can match any number of fragments. (You can even match overlapping fragments.)
In my last example I used :g
, which is short for :global
, which is a well known feature among traditional regex engines. :g
matches as many times as a match is found in the input string (but not overlapping).
The match operation then returns either Nil
(no matches at all) or a list of match objects (one or more). I've applied a .max(*.chars)
to yield the longest match (the first if there are multiple longest sub-strings).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With