I have a vector of sentences such as:
example <- c("text text word1 text text word2 text text", ...)
and I'm trying to identify which sentences comply with the following rules:
This could be done with a normal regex. However, the problem is that "word1" or "word2" can contain typos (I am expecting at most a distance of 3 for both words together). Examples of typos could be "wrod1", "woord2", "wrd1", etc. I also want to match the sentences that contain typos for these words within the distance constraint. Therefore I was trying to use agrepl:
agrepl("(?:.*?)\\bword1\\b(?:\\s(?:\\w+\\s){0,3})\\bword2\\b(?:.*?)", example, fixed=FALSE, max=3)
However, I believe that the distance is being calculated with the whole sentence and not only with "word1" and "word2", and therefore I will almost never get any matches in this way. Any suggestions on how to fix this, or is agrepl/regex not the best tool for this problem?
This fit for your rules, however I'm not sure what would your typos looks like. If you could show some example, it would be great.
^(?=.*word1\s+(?:\S+\s+){0,3}word2.*$).*
DEMO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With