Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difficulties with `agrep(..., fixed=F)`

Tags:

r

In ?agrep (grep with fuzzy matching) it mentions that I can set the argument fixed=FALSE to let my pattern be interpreted as a regular expression.

However, I can't get it to work!

agrep('(asdf|fdsa)', 'asdf', fixed=F)
# integer(0)

The above should match as the regular expression "(asdf|fdsa)" exactly matches the test string "asdf" in this case.

To confirm:

grep('(asdf|fdsa)', 'asdf', fixed=F)
# 1 : it does match with grep

And even more confusingly, adist correctly gives the distance between the pattern and string as 0, meaning that agrep should definitely return 1 rather than integer(0) (there's no possibility that 0 is greater than the default max.dist = 0.1).

adist('(asdf|fdsa)', 'asdf', fixed=F)
#      [,1]
# [1,]    0

Why is this not working? Is there something I don't understand? A workaround? I'm happy to use adist, but am not entirely sure how to convert agrep's default max.distance=0.1 parameter to adist's corresponding parameter.

(yes, I'm stuck on an old computer that can't do better than R 2.15.2)

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i686-redhat-linux-gnu (32-bit)    
locale:
 [1] LC_CTYPE=en_AU.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_AU.utf8        LC_COLLATE=en_AU.utf8    
 [5] LC_MONETARY=en_AU.utf8    LC_MESSAGES=en_AU.utf8   
 [7] LC_PAPER=C                LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_AU.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base 
like image 785
mathematical.coffee Avatar asked Apr 08 '13 05:04

mathematical.coffee


2 Answers

tl;dr: agrep(..., fixed=F) does not seem to work with the '|' character. Use aregexec.

Upon further investigation, I think this is a bug, and that agrep(..., fixed=F) does not seem to work with '|' regexes (although adist(..., fixed=F) does).

To elaborate, note that

adist('(asdf|fdsa)', 'asdf', fixed=T) # 7
nchar('(asdf|fdsa)')                  # 11

If 'asdf' were agrep'd to the non-regular-expression string '(asdf|fdsa)', then it would have distance 7.

On that note:

agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=7) # 1
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=6) # integer(0)

These are the results I'd expect if fixed=T. If fixed=F, my regex would match 'asdf' exactly and the distance would be 0, so I'd always get a result of '1' back out of agrep.

So it looks agrep(pattern, x, fixed=F) does not work, i.e. it actually regardes fixed as TRUE for this sort of pattern.

As @Arun mentions, it might just be '|' regexes that don't work. For example, agrep('la[sb]y', 'lazy', fixed=FALSE) does work as expected.


EDIT: Workaround (thanks @Arun)

The function aregexec appears to work.

> aregexec('(asdf|fdsa)', 'asdf', fixed=F)
[[1]]
[1] 1 1
attr(,"match.length")
[1] 4 4
like image 176
mathematical.coffee Avatar answered Oct 22 '22 13:10

mathematical.coffee


This has (finally) been fixed in the R sources "trunk" / R-devel") and R-patched which will become R 3.5.1 early July 2018.

like image 25
Martin Mächler Avatar answered Oct 22 '22 11:10

Martin Mächler