Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Regexp - extract number with 5 digits

Tags:

regex

r

i have a string a like this one:

stundenwerte_FF_00691_19260101_20131231_hist.zip

and would like to extract the 5-digit number "00691" from it.

I tried using gregexpr and regmatches as well as stringr::str_extract but couldn't figute out the right rexexp. I came as far as:

gregexpr("[:digits{5}:]",a)

Which should return 5-digit-numbers and i dont understand how to fix it.
This does not work propperly :(

m <- gregexpr("[:digits{5}:]",a)
regmatches(a,m)

Thanks for your help in advance!

like image 393
Rentrop Avatar asked Oct 25 '14 02:10

Rentrop


4 Answers

You could simply use sub to grab the digits, IMO regmatches is not necessary for this simple case.

x <- 'stundenwerte_FF_00691_19260101_20131231_hist.zip'
sub('\\D*(\\d{5}).*', '\\1', x)
# [1] "00691"

Edit: If you have other strings that contain digits in front, you would slightly modify the expression.

sub('.*_(\\d{5})_.*', '\\1', x)
like image 152
hwnd Avatar answered Oct 23 '22 19:10

hwnd


1) sub

sub(".*_(\\d{5})_.*", "\\1", x)
## [1] "00691"

2) gsubfn::strapplyc The regexp can be slightly simplified if we use strapplyc:

library(gsubfn)

strapplyc(x, "_(\\d{5})_", simplify = TRUE)
## [1] "00691"

3) strsplit If we know that it is the third field:

read.table(text = x, sep = "_", colClasses = "character")$V3
## [1] "00691"

3a) or

strsplit(x, "_")[[1]][3]
## [1] "00691"
like image 37
G. Grothendieck Avatar answered Oct 23 '22 18:10

G. Grothendieck


You could try the below regex which uses negative lookaround assertions. We can't use word boundaries here like \\b\\d{5}\\b because the preceding and the following character _ comes under \w

> x <- "stundenwerte_FF_00691_19260101_20131231_hist.zip"
> m <- regexpr("(?<!\\d)\\d{5}(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] "00691"
> m <- gregexpr("(?<!\\d)\\d{5}(?!\\d)", x, perl=TRUE)
> regmatches(x, m)[[1]]
[1] "00691"

Explanation:

  • (?<!\\d) Negative lookbehind asserts that what precedes the match would be any but not a digit.
  • \\d{5} Match exactly 5 digits.
  • (?!\\d) Negative lookahead asserts that the character following the match would be any but not a digit.
like image 4
Avinash Raj Avatar answered Oct 23 '22 18:10

Avinash Raj


Let string be:

ss ="stundenwerte_FF_00691_19260101_20131231_hist.zip"

You can split the string and unlist the substrings:

ll = unlist(strsplit(ss,'_'))

Then get indexes of substrings set to TRUE if they are 5 characters long:

idx = sapply(ll, nchar)==5

And get the ones which are 5 characters long:

ll[idx]
[1] "00691"
like image 1
rnso Avatar answered Oct 23 '22 17:10

rnso