I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
Extracting Substrings from a Character Vector in R Programming – substring() Function. substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.
The grepl() stands for “grep logical”. In R it is a built-in function that searches for matches of a string or string vector. The grepl() method takes a pattern and data and returns TRUE if a string contains the pattern, otherwise FALSE.
The grep returns indices of matched items or matched items themselves while grepl returns a logical vector with TRUE to represent a match and FALSE otherwise. Both functions can be used to match a pattern to change or replace it or to filter data.
In order to extract the first n characters with the substr command, we needed to specify three values within the function: The character string (in our case x). The first character we want to keep (in our case 1). The last character we want to keep (in this specific example we extracted the first 3 values).
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With