Leapfrogging from a previous question, I'm having problem with the proper reg expression syntax to isolate a specific word.
Given a data frame:
DL<-c("Dark_ark","Light-Lis","dark7","DK_dark","The_light","Lights","Lig_dark","D_Light")
Col1<-c(1,12,3,6,4,8,2,8)
DF<-data.frame(Col1)
row.names(DF)<-DL
I'm looking extract all of the "Dark" and "Light" (ignoring upper vs lower case) from the row names and make a second column containing only the string "Dark" or "Light"
Col2<-c("Dark","Light","dark","dark","light","Light","dark","Light")
DF$Col2<-Col2
Col1 Col2
Dark_ark 1 Dark
Light-Lis 12 Light
dark7 3 dark
DK_dark 6 dark
The_light 4 light
Lights 8 Light
Lig_dark 2 dark
D_Light 8 Light
Ive changed the original data a bit to detail my current issue, but working of an excellent answer from Tyler Rinker, I used this:
DF$Col2<-gsub("[^dark|light]", "", row.names(DF), ignore.case = TRUE)
But the gsub gets tripped up on some of the letters in common. Searching the message boards for isolating an exact word with regex, it looks like the answer should be to use double slash with either
\\<light\\>
or
\\blight\\b
So why does the line
DF$Col2<-gsub("[^\\<dark\\>|\\<light\\>]", "", row.names(DF), ignore.case = TRUE)
Not pull the desired column above? Instead I get
Col1 Col2
Dark_ark 1 Darkark
Light-Lis 12 LightLi
dark7 3 dark
DK_dark 6 DKdark
The_light 4 Thlight
Lights 8 Light
Lig_dark 2 Ligdark
D_Light 8 DLight
To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).
While dealing with text data, we sometimes need to extract values between two words. These words can be close to each other, at the end sides or on random sides. If we want to extract the strings between two words then str_extract_all function of stringr package can be used.
The gsub() function in R is used for replacement operations. The functions takes the input and substitutes it against the specified values. The gsub() function always deals with regular expressions. You can use the regular expressions as the parameter of substitution.
#gsub is not only slower, but it also requires an extra effort for the reader to 'decode' the arguments.
How about this?
unlist(regmatches(rownames(DF), gregexpr("dark|light", rownames(DF), ignore.case=TRUE)))
# [1] "Dark" "Light" "dark" "dark" "light" "Light" "dark" "Light"
or
gsub(".*(dark|light).*$", "\\1", row.names(DF), ignore.case = TRUE)
# [1] "Dark" "Light" "dark" "dark" "light" "Light" "dark" "Light"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With