Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract substring in R using grepl

I have a table with a string column formatted like this

abcdWorkstart.csv
abcdWorkcomplete.csv

And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.

grepl("Work{*}.csv", data$filename)

Basically I want to extract whatever between Work and .csv

desired outcome:

start
complete
like image 603
ajax2000 Avatar asked Aug 28 '18 14:08

ajax2000


People also ask

How do I extract a substring from a string in R?

Extracting Substrings from a Character Vector in R Programming – substring() Function. substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.

What does Grepl do in R?

The grepl() stands for “grep logical”. In R it is a built-in function that searches for matches of a string or string vector. The grepl() method takes a pattern and data and returns TRUE if a string contains the pattern, otherwise FALSE.

What is the difference between grep and Grepl in R?

The grep returns indices of matched items or matched items themselves while grepl returns a logical vector with TRUE to represent a match and FALSE otherwise. Both functions can be used to match a pattern to change or replace it or to filter data.

How do I get the first part of a string in R?

In order to extract the first n characters with the substr command, we needed to specify three values within the function: The character string (in our case x). The first character we want to keep (in our case 1). The last character we want to keep (in this specific example we extracted the first 3 values).


2 Answers

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start"           "complete"        "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ]
# [1] "start"    "complete"

Or marking them invalid with NA (or something else):

out[ out == fn ] <- NA
out
# [1] "start"    "complete" NA        
like image 84
r2evans Avatar answered Oct 27 '22 17:10

r2evans


Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start"    "complete"

data

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
like image 21
akrun Avatar answered Oct 27 '22 18:10

akrun