R remove repeated digit sequences

Question

I am trying to remove all digits in a string except the first set of digits. So in other words, all repeating sets of digits, there could be 1 sets or 10+ sets in the string but I only want to keep the first set along with the rest of the string.

For example, the following string:

x <- 'foo123bar123baz123456abc1111def123456789'

The result would be:

foo123barbazabcdef

I am have tried using gsub and replacing \d+ with an empty string but this replaces all digits in the string, I have also tried using groups to capture some of the results but had no luck.

Avinash Raj · Accepted Answer

You could do this through PCRE verb (*SKIP)(*F).

^\D*\d+(*SKIP)(*F)|\d+

^\D*\d+ matches all the characters from the start upto the first number. (*SKIP)(*F) causes the match to fail and then the regex engine tries to match the characters using the pattern which was at the right side of | that is \d+ against the remaining string. Because (*SKIP)(*F) is a PCRE verb, you must need to enable perl=TRUE parameter.

DEMO

Code:

> x <- 'foo123bar123baz123456abc1111def123456789'
> gsub("^\D*\d+(*SKIP)(*F)|\d+", "", x, perl=TRUE)
[1] "foo123barbazabcdef"

hwnd · Answer

Using gsub you can use the \G feature, an anchor that can match at one of two positions.

x <- 'foo123bar123baz123456abc1111def123456789'
gsub('(?:\d+|\G(?<!^)\D*)\K\d*', '', x, perl=T)
# [1] "foo123barbazabcdef"

Explanation:

(?:           # group, but do not capture:
  \d+         #   digits (0-9) (1 or more times)
 |            # OR
  \G(?<!^)    #   contiguous to a precedent match, not at the start of the string
  \D*         #   non-digits (all but 0-9) (0 or more times)
)\K           # end of grouping and reset the match from the result
\d*           # digits (0-9) (0 or more times)

Alternatively, you can use an optional group:

gsub('(?:^\D*\d+)?\K\d*', '', x, perl=T)

Another way that I find useful and does not require (*SKIP)(*F) backtracking verbs or the \G and \K feature is to use the alternation operator in context placing what you want to match in a capturing group on the left side and place what you want to exclude on the right side, (saying throw this away, it's garbage...)

gsub('^(\D*\d+)|\d+', '\1', x)

R remove repeated digit sequences

Tags:

regex

r

user3856888

2 Answers

Avinash Raj

hwnd

Recent Activity

Donate For Us

R remove repeated digit sequences

Tags:

regex

r

user3856888

2 Answers

Avinash Raj

hwnd

Related questions

Recent Activity

Donate For Us