Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to selectively apply this stringr function?

I have a dataframe of users, with one column containing their self-reported location. Because of this, some locations reported are nonsensical but can lead to a false positive when matching this column to other columns of known locations. Below is an example of the data frame.

data <- data.frame(X = (1:5), Y = c("", "Washington, DC", "Huntsville, AL", "Mobile,AL", "ALL OVER"))

With this data, I then run this code below to establish matches with AL.

library(stringr)
data$match_ab <- str_extract(data[,2], str_c("AL", collapse = "|"))

This results in Huntsville and Mobile being correctly identified as positives, but the third match of ALL OVER incorrectly identifies as a match because of the AL within the string.

Is there a way to adapt this script so that it detects matches within strings while ignoring strings that have letters attached to the desired part of the string? In other words, can this detect AL while there might be spaces or punctuation on either side of the partial string while ignoring the match if alphabetical letters are adjacent to the string?

Thanks in advance.

like image 438
Auresm Avatar asked Dec 18 '25 19:12

Auresm


2 Answers

Does this work for you If I understood you correctly:

data$match_ab <- str_extract(data[,2], "\\bAL\\b")

Using \\b which is a boundary condition so that it doesn't match anything if it is followed/preceded by a word or As per documentation: the symbol \b matches the empty string at either edge of a word

like image 176
PKumar Avatar answered Dec 20 '25 13:12

PKumar


Just a little tweak of matching at a particular position: Add $ after your search_item, which is a regex that specifies: it needs to be matched if present only at the end of the string.

data$match_ab <- str_extract(data[,2], str_c("AL$", collapse = "|")); data;

  X              Y match_ab
1 1                    <NA>
2 2 Washington, DC     <NA>
3 3 Huntsville, AL       AL
4 4      Mobile,AL       AL
5 5       ALL OVER     <NA>

Suppose the AL is in the middle of the string, then this might be more general to use:

data <- data.frame(X = (1:5), Y = c("", "Washington, DC", "Huntsville, AL, 
                   SOMETHING_AT_THE_END", "Mobile,AL", "ALL OVER")); data;
  X                                    Y
1 1                                     
2 2                       Washington, DC
3 3 Huntsville, AL, SOMETHING_AT_THE_END
4 4                            Mobile,AL
5 5                             ALL OVER

data$match_ab <- str_extract(data[,2], str_c("AL(?!L)", collapse = "|")); data;
  X                                    Y match_ab
1 1                                          <NA>
2 2                       Washington, DC     <NA>
3 3 Huntsville, AL, SOMETHING_AT_THE_END       AL
4 4                            Mobile,AL       AL
5 5                             ALL OVER     <NA>

Where (?!L) means not ! followed by ? L.

like image 35
massisenergy Avatar answered Dec 20 '25 14:12

massisenergy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!