I'm trying to extract part of a filename using R, I have a vague idea about how to go about this from here: extract part of a file name in R however I can't quite get this to work on my list of filenames
example of filenames:
"Species Count (2011-12-15-07-09-39).xls"
"Species Count 0511.xls"
"Species Count 151112.xls"
"Species Count1011.xls"
"Species Count2012-01.xls"
"Species Count201207.xls"
"Species Count2013-01-15.xls"
Some of the filenames have a space between Species Count and the date, some without a space, and they are of different lengths and some contain brackets. I just want to extract the numerical part of the filename and to keep the -'s aswell. So for example for the data above I would have:
Expected output:
2011-12-15-07-09-39 , 0511 , 151112 , 1011 , 2012-01 , 201207 , 2013-01-15
Here's one way:
regmatches(tt, regexpr("[0-9].*[0-9]", tt))
I assume that there are no other numbers in your file names. So, we just search for start of a number and use the greedy operator .*
so that everything until the last number is captured. This is done using regexpr
which'll get the position of matches. Then we use regmatches
to extract the (sub)string out of these matched positions.
where tt
is:
tt <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls",
"Species Count 151112.xls", "Species Count1011.xls",
"Species Count2012-01.xls", "Species Count201207.xls",
"Species Count2013-01-15.xls")
Quite some nice answers there. So, it's time for benchmarking :)
tt <- rep(tt, 1e5) # tt is from above
require(microbenchmark)
require(stringr)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt))
bb <- function() gsub("[A-z \\.\\(\\)]", "", tt)
cc <- function() str_extract(tt,'([0-9]|[0-9][-])+')
microbenchmark(arun <- aa(), agstudy <- cc(), Jean <- bb(), times=25)
Unit: seconds
expr min lq median uq max neval
arun <- aa() 1.951362 2.064055 2.198644 2.397724 3.236296 25
agstudy <- cc() 2.489993 2.685285 2.991796 3.198133 3.762166 25
Jean <- bb() 7.824638 8.026595 9.145490 9.788539 10.926665 25
identical(arun, agstudy) # TRUE
identical(arun, Jean) # TRUE
Use the function gsub()
to remove all of the letters, spaces, periods, and parentheses. Then you will be left with numbers and hyphens. For example,
x <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls",
"Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls",
"Species Count201207.xls", "Species Count2013-01-15.xls")
gsub("[A-z \\.\\(\\)]", "", x)
[1] "2011-12-15-07-09-39" "0511" "151112"
[4] "1011" "2012-01" "201207"
[7] "2013-01-15"
If you're concerned about speed, you can use sub
with back-references to extract the portions you want. Also note that perl=TRUE
is often faster (according to ?grep
).
jj <- function() sub("[^0-9]*([0-9].*[0-9])[^0-9]*", "\\1", tt, perl=TRUE)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt, perl=TRUE))
# Run on R-2.15.2 on 32-bit Windows
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: milliseconds
# expr min lq median uq max
# 1 arun <- aa() 2156.5024 2189.5168 2191.9972 2195.4176 2410.3255
# 2 josh <- jj() 390.0142 390.8956 391.6431 394.5439 493.2545
identical(arun, josh) # TRUE
# Run on R-3.0.1 on 64-bit Ubuntu
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: seconds
# expr min lq median uq max neval
# arun <- aa() 1.794522 1.839044 1.858556 1.894946 2.207016 25
# josh <- jj() 1.003365 1.008424 1.009742 1.059129 1.074057 25
identical(arun, josh) # still TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With