Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: extract part of a filename

Tags:

r

filenames

I'm trying to extract part of a filename using R, I have a vague idea about how to go about this from here: extract part of a file name in R however I can't quite get this to work on my list of filenames

example of filenames:

"Species Count (2011-12-15-07-09-39).xls"
"Species Count 0511.xls"
"Species Count 151112.xls" 
"Species Count1011.xls" 
"Species Count2012-01.xls" 
"Species Count201207.xls" 
"Species Count2013-01-15.xls"  

Some of the filenames have a space between Species Count and the date, some without a space, and they are of different lengths and some contain brackets. I just want to extract the numerical part of the filename and to keep the -'s aswell. So for example for the data above I would have:

Expected output:

2011-12-15-07-09-39 , 0511 , 151112 , 1011 , 2012-01 , 201207 , 2013-01-15
like image 394
userk Avatar asked Aug 06 '13 13:08

userk


3 Answers

Here's one way:

regmatches(tt, regexpr("[0-9].*[0-9]", tt))

I assume that there are no other numbers in your file names. So, we just search for start of a number and use the greedy operator .* so that everything until the last number is captured. This is done using regexpr which'll get the position of matches. Then we use regmatches to extract the (sub)string out of these matched positions.


where tt is:

tt <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
        "Species Count 151112.xls", "Species Count1011.xls", 
        "Species Count2012-01.xls", "Species Count201207.xls", 
        "Species Count2013-01-15.xls")

Benchmarking:

Note: Benchmarking results may differ between Windows and *nix machines (as @Hansi notes below under comments).

Quite some nice answers there. So, it's time for benchmarking :)

tt <- rep(tt, 1e5) # tt is from above

require(microbenchmark)
require(stringr)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt))
bb <- function() gsub("[A-z \\.\\(\\)]", "", tt)
cc <- function() str_extract(tt,'([0-9]|[0-9][-])+')

microbenchmark(arun <- aa(), agstudy <- cc(), Jean <- bb(), times=25)
Unit: seconds
            expr      min       lq   median       uq       max neval
    arun <- aa() 1.951362 2.064055 2.198644 2.397724  3.236296    25
 agstudy <- cc() 2.489993 2.685285 2.991796 3.198133  3.762166    25
    Jean <- bb() 7.824638 8.026595 9.145490 9.788539 10.926665    25

identical(arun, agstudy) # TRUE
identical(arun, Jean) # TRUE
like image 117
Arun Avatar answered Sep 28 '22 01:09

Arun


Use the function gsub() to remove all of the letters, spaces, periods, and parentheses. Then you will be left with numbers and hyphens. For example,

x <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
    "Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls", 
    "Species Count201207.xls", "Species Count2013-01-15.xls")

gsub("[A-z \\.\\(\\)]", "", x)

[1] "2011-12-15-07-09-39" "0511"                "151112"             
[4] "1011"                "2012-01"             "201207"             
[7] "2013-01-15"         
like image 37
Jean V. Adams Avatar answered Sep 28 '22 01:09

Jean V. Adams


If you're concerned about speed, you can use sub with back-references to extract the portions you want. Also note that perl=TRUE is often faster (according to ?grep).

jj <- function() sub("[^0-9]*([0-9].*[0-9])[^0-9]*", "\\1", tt, perl=TRUE)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt, perl=TRUE))

# Run on R-2.15.2 on 32-bit Windows
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: milliseconds
#           expr       min        lq    median        uq       max
# 1 arun <- aa() 2156.5024 2189.5168 2191.9972 2195.4176 2410.3255
# 2 josh <- jj()  390.0142  390.8956  391.6431  394.5439  493.2545
identical(arun, josh)  # TRUE

# Run on R-3.0.1 on 64-bit Ubuntu
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  arun <- aa() 1.794522 1.839044 1.858556 1.894946 2.207016    25
#  josh <- jj() 1.003365 1.008424 1.009742 1.059129 1.074057    25
identical(arun, josh)  # still TRUE
like image 20
Joshua Ulrich Avatar answered Sep 28 '22 02:09

Joshua Ulrich