In ?agrep
(grep
with fuzzy matching) it mentions that I can set the argument fixed=FALSE
to let my pattern be interpreted as a regular expression.
However, I can't get it to work!
agrep('(asdf|fdsa)', 'asdf', fixed=F)
# integer(0)
The above should match as the regular expression "(asdf|fdsa)" exactly matches the test string "asdf" in this case.
To confirm:
grep('(asdf|fdsa)', 'asdf', fixed=F)
# 1 : it does match with grep
And even more confusingly, adist
correctly gives the distance between the pattern and string as 0, meaning that agrep
should definitely return 1 rather than integer(0)
(there's no possibility that 0 is greater than the default max.dist = 0.1
).
adist('(asdf|fdsa)', 'asdf', fixed=F)
# [,1]
# [1,] 0
Why is this not working? Is there something I don't understand? A workaround?
I'm happy to use adist
, but am not entirely sure how to convert agrep
's default max.distance=0.1
parameter to adist
's corresponding parameter.
(yes, I'm stuck on an old computer that can't do better than R 2.15.2)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i686-redhat-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_AU.utf8 LC_NUMERIC=C
[3] LC_TIME=en_AU.utf8 LC_COLLATE=en_AU.utf8
[5] LC_MONETARY=en_AU.utf8 LC_MESSAGES=en_AU.utf8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
tl;dr: agrep(..., fixed=F)
does not seem to work with the '|' character. Use aregexec
.
Upon further investigation, I think this is a bug, and that agrep(..., fixed=F)
does not seem to work with '|' regexes (although adist(..., fixed=F)
does).
To elaborate, note that
adist('(asdf|fdsa)', 'asdf', fixed=T) # 7
nchar('(asdf|fdsa)') # 11
If 'asdf' were agrep
'd to the non-regular-expression string '(asdf|fdsa)', then it would have distance 7.
On that note:
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=7) # 1
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=6) # integer(0)
These are the results I'd expect if fixed=T
. If fixed=F
, my regex would match 'asdf' exactly and the distance would be 0, so I'd always get a result of '1' back out of agrep
.
So it looks agrep(pattern, x, fixed=F)
does not work, i.e. it actually regardes fixed
as TRUE for this sort of pattern.
As @Arun mentions, it might just be '|' regexes that don't work. For example, agrep('la[sb]y', 'lazy', fixed=FALSE)
does work as expected.
The function aregexec
appears to work.
> aregexec('(asdf|fdsa)', 'asdf', fixed=F)
[[1]]
[1] 1 1
attr(,"match.length")
[1] 4 4
This has (finally) been fixed in the R sources "trunk" / R-devel") and R-patched which will become R 3.5.1 early July 2018.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With