Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use stringr in R to find the remaining string after last substring [duplicate]

Tags:

regex

r

stringr

How can I use str_match to extract the remaining string after the last substring.

For example, for the string "apples and oranges and bananas with cream", I'd like to extract the remainder of this string after the last occurrence of " and " to return "bananas and cream".

I have tried many alternatives to this command but it either keeps returning the remainder of the string after the first "and" or an empty string.

library(stringr)

str_match("apples and oranges and bananas with cream", "(?<= and ).*(?! and )")

    #     [,1]                             
    #[1,] "oranges and bananas with cream"

I've searched StackOverflow for solutions and found some for javascript, Python and base R but have found none for stringr package.

Thanks.

like image 715
James N Avatar asked May 05 '18 01:05

James N


People also ask

How to find substrings in R?

R provides different ways to find substrings. These are: Find substring in R using substr () method in R Programming is used to find the sub-string from starting index to the ending index values in a string. Return: Returns the sub string from a given string using indexes.

What is Stringr in R?

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible.

How to replace a substring with a string in Python?

If you want to replace a substring with a string with different length, you might have a look at the gsub function. However, let’s move on to the next example. Another difference between substr and substring is the possibility to extract several substrings with one line of code. With substr, this is not possible.

How to find last n characters of the column in R?

In below example we have used str_sub () function to find last n characters of the column in R. str_sub () function takes column name, number of characters from last with minus symbol. Extract first word of the column with str_extract () function along with regular expression is shown below


2 Answers

(Don't know about str_match. Base R regex should suffice, though.) Since regex pattern matching is "greedy", i.e. it will search for all of the matches and pick the last one, it's just:

sub("^.+and ", "", "apples and oranges and bananas with cream")
#[1] "bananas with cream"

I'm pretty sure there would be an equivalent in the "lubridate" corner of the hadleyverse.

Then failure with:

 library(lubridate)

Attaching package: ‘lubridate’

The following object is masked from ‘package:plyr’:

    here

The following objects are masked from ‘package:data.table’:

    hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year

The following object is masked from ‘package:base’:

    date

> str_replace("apples and oranges and bananas with cream", "^.+and ", "")
Error in str_replace("apples and oranges and bananas with cream", "^.+and ",  : 
  could not find function "str_replace"

So it's not in pkg:lubridate but rather in stringr (which as I understand it is a very light wrapper around the stringi package):

library(stringr)
 str_replace("apples and oranges and bananas with cream", "^.+and ", "")
[1] "bananas with cream"

I do wish that people who ask questions about non-base package functions would include a library call to give respondents a clue as to their working envirinment.

like image 118
IRTFM Avatar answered Oct 28 '22 08:10

IRTFM


Another simple approach is to use a variation of the *SKIP what's to avoid schema using capture groups, i.e. What_I_want_to_avoid|(What_I_want_to_match):

library(stringr)
s  <- "apples and oranges and bananas with cream"
str_match(s, "^.+and (.*)")[,2]

The key idea here is to completely disregard the overall matches returned by the regex engine: that's the trash bin. Instead, we only need to check capture group 1 through [,2], which, when set, contains what we are looking for. See also: http://www.rexegg.com/regex-best-trick.html#pseudoregex

We can do a similar thing using base R gsub-functions, e.g.

gsub("^.+and (.*)", "\\1", s, perl = TRUE)

PS: Unfortunately, we cannot use the What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match pattern with stringi/stringr functions since the referenced ICU regex library that does not include the (*SKIP)(*FAIL) verbs (they are only in PCRE available).

like image 40
wp78de Avatar answered Oct 28 '22 09:10

wp78de