Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a specific word using gsub and regex

Tags:

regex

r

gsub

Leapfrogging from a previous question, I'm having problem with the proper reg expression syntax to isolate a specific word.

Given a data frame:

DL<-c("Dark_ark","Light-Lis","dark7","DK_dark","The_light","Lights","Lig_dark","D_Light")
Col1<-c(1,12,3,6,4,8,2,8)
DF<-data.frame(Col1)
row.names(DF)<-DL

I'm looking extract all of the "Dark" and "Light" (ignoring upper vs lower case) from the row names and make a second column containing only the string "Dark" or "Light"

Col2<-c("Dark","Light","dark","dark","light","Light","dark","Light")
DF$Col2<-Col2

          Col1  Col2
Dark_ark     1  Dark
Light-Lis   12 Light
dark7        3  dark
DK_dark      6  dark
The_light    4 light
Lights       8 Light
Lig_dark     2  dark
D_Light      8 Light

Ive changed the original data a bit to detail my current issue, but working of an excellent answer from Tyler Rinker, I used this:

DF$Col2<-gsub("[^dark|light]", "", row.names(DF), ignore.case = TRUE)

But the gsub gets tripped up on some of the letters in common. Searching the message boards for isolating an exact word with regex, it looks like the answer should be to use double slash with either

\\<light\\>

or

\\blight\\b

So why does the line

DF$Col2<-gsub("[^\\<dark\\>|\\<light\\>]", "", row.names(DF), ignore.case = TRUE)

Not pull the desired column above? Instead I get

          Col1    Col2
Dark_ark     1 Darkark
Light-Lis   12 LightLi
dark7        3    dark
DK_dark      6  DKdark
The_light    4 Thlight
Lights       8   Light
Lig_dark     2 Ligdark
D_Light      8  DLight
like image 538
Vinterwoo Avatar asked Jul 28 '13 22:07

Vinterwoo


People also ask

How would you extract one particular word from a string in R?

To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).

How do I extract a string between two patterns in R?

While dealing with text data, we sometimes need to extract values between two words. These words can be close to each other, at the end sides or on random sides. If we want to extract the strings between two words then str_extract_all function of stringr package can be used.

How does GSUB work in R?

The gsub() function in R is used for replacement operations. The functions takes the input and substitutes it against the specified values. The gsub() function always deals with regular expressions. You can use the regular expressions as the parameter of substitution.

Is GSUB slow?

#gsub is not only slower, but it also requires an extra effort for the reader to 'decode' the arguments.


1 Answers

How about this?

unlist(regmatches(rownames(DF), gregexpr("dark|light", rownames(DF), ignore.case=TRUE)))
# [1] "Dark"  "Light" "dark"  "dark"  "light" "Light" "dark"  "Light"

or

gsub(".*(dark|light).*$", "\\1", row.names(DF), ignore.case = TRUE)
# [1] "Dark"  "Light" "dark"  "dark"  "light" "Light" "dark"  "Light"
like image 175
Arun Avatar answered Nov 02 '22 23:11

Arun