I am trying to extract characters before and after the "/" character using R.
For example, I can get the tags with the following:
s <- "hello/JJ world/NN"
# get the tags
sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\2", x)})
which returns
"JJ NN"
However, when I try to extract the characters before the "/" or the "tokens", using the following:
sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\1", x)})
I get
"helloJ worldN"
How can I get "hello world" and why is the first letter of the tag slipping in there?
substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.
To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).
I think the reason you get those letters remaining in the output is your regex. The [A-Z]
(there must be Z
, I guess z
is a typo - see [A-Za-z] Shorthand class?) is OK, but it is followed by a .*?
lazy dot matching group that can match 0 or unlimited characters other than newline as few as possible. So, it will match none.
You need a +
quantifier to match 1 or more characters and apply it to the character class [a-zA-Z]
:
s <- "hello/JJ world/NN"
sapply(s, function(x){gsub("([a-zA-Z])/[a-zA-Z]+", "\\1", x)})
See demo
I removed the second group since you are not using it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With