Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R remove repeated digit sequences

Tags:

regex

r

I am trying to remove all digits in a string except the first set of digits. So in other words, all repeating sets of digits, there could be 1 sets or 10+ sets in the string but I only want to keep the first set along with the rest of the string.

For example, the following string:

x <- 'foo123bar123baz123456abc1111def123456789'

The result would be:

foo123barbazabcdef

I am have tried using gsub and replacing \d+ with an empty string but this replaces all digits in the string, I have also tried using groups to capture some of the results but had no luck.

like image 926
user3856888 Avatar asked Nov 30 '14 06:11

user3856888


2 Answers

You could do this through PCRE verb (*SKIP)(*F).

^\D*\d+(*SKIP)(*F)|\d+

^\D*\d+ matches all the characters from the start upto the first number. (*SKIP)(*F) causes the match to fail and then the regex engine tries to match the characters using the pattern which was at the right side of | that is \d+ against the remaining string. Because (*SKIP)(*F) is a PCRE verb, you must need to enable perl=TRUE parameter.

DEMO

Code:

> x <- 'foo123bar123baz123456abc1111def123456789'
> gsub("^\\D*\\d+(*SKIP)(*F)|\\d+", "", x, perl=TRUE)
[1] "foo123barbazabcdef"
like image 35
Avinash Raj Avatar answered Nov 10 '22 12:11

Avinash Raj


Using gsub you can use the \G feature, an anchor that can match at one of two positions.

x <- 'foo123bar123baz123456abc1111def123456789'
gsub('(?:\\d+|\\G(?<!^)\\D*)\\K\\d*', '', x, perl=T)
# [1] "foo123barbazabcdef"

Explanation:

(?:           # group, but do not capture:
  \d+         #   digits (0-9) (1 or more times)
 |            # OR
  \G(?<!^)    #   contiguous to a precedent match, not at the start of the string
  \D*         #   non-digits (all but 0-9) (0 or more times)
)\K           # end of grouping and reset the match from the result
\d*           # digits (0-9) (0 or more times)

Alternatively, you can use an optional group:

gsub('(?:^\\D*\\d+)?\\K\\d*', '', x, perl=T)

Another way that I find useful and does not require (*SKIP)(*F) backtracking verbs or the \G and \K feature is to use the alternation operator in context placing what you want to match in a capturing group on the left side and place what you want to exclude on the right side, (saying throw this away, it's garbage...)

gsub('^(\\D*\\d+)|\\d+', '\\1', x)
like image 125
hwnd Avatar answered Nov 10 '22 11:11

hwnd