I am trying to extract characters before and after the "/" character using R. For example, I can get the tags with the following: <pre class="prettyprint"><code>s <- "hello/JJ world/NN" # get the tags sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\2", x)}) </code></pre> which returns <pre class="prettyprint"><code>"JJ NN" </code></pre> However, when I try to extract the characters before the "/" or the "tokens", using the following: <pre class="prettyprint"><code>sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\1", x)}) </code></pre> I get <pre class="prettyprint"><code>"helloJ worldN" </code></pre> How can I get "hello world" and why is the first letter of the tag slipping in there?

I think the reason you get those letters remaining in the output is your regex. The <code>[A-Z]</code> (there must be <code>Z</code>, I guess <code>z</code> is a typo - see [A-Za-z] Shorthand class?) is OK, but it is followed by a <code>.*?</code> lazy dot matching group that can match 0 or unlimited characters other than newline as few as possible. So, it will match none. You need a <code>+</code> quantifier to match 1 or more characters and apply it to the character class <code>[a-zA-Z]</code>: <pre class="prettyprint"><code>s <- "hello/JJ world/NN" sapply(s, function(x){gsub("([a-zA-Z])/[a-zA-Z]+", "\\1", x)}) </code></pre> See demo I removed the second group since you are not using it.

Extract characters up to "/" using R

Tags:

regex

r

I am trying to extract characters before and after the "/" character using R.

For example, I can get the tags with the following:

s <- "hello/JJ world/NN"

# get the tags
sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\2", x)})

which returns

"JJ NN"

However, when I try to extract the characters before the "/" or the "tokens", using the following:

sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\1", x)})

I get

"helloJ worldN"

How can I get "hello world" and why is the first letter of the tag slipping in there?

230

asked Aug 02 '15 22:08

Justin Nafe

1 Answers

I think the reason you get those letters remaining in the output is your regex. The [A-Z] (there must be Z, I guess z is a typo - see [A-Za-z] Shorthand class?) is OK, but it is followed by a .*? lazy dot matching group that can match 0 or unlimited characters other than newline as few as possible. So, it will match none.

You need a + quantifier to match 1 or more characters and apply it to the character class [a-zA-Z]:

s <- "hello/JJ world/NN"
sapply(s, function(x){gsub("([a-zA-Z])/[a-zA-Z]+", "\\1", x)})

See demo

I removed the second group since you are not using it.

135

answered Oct 16 '22 12:10

Wiktor Stribiżew

Related questions
                            
                                Sizzle's Javascript regex for CLASS?
                            
                                Using regex to check comma's usage
                            
                                Matching single digit with std::regex_match
                            
                                How should I use exact keyword matching as a condition in the case statement?
                            
                                Iterate over values in pandas column containing lists and retrieve only unique values
                            
                                Using SED to match emails in a sql dump and replace them
                            
                                How to get group matches of regular expressions in CMake?
                            
                                Regex replace from number type input
                            
                                Julia Regular Expressions
                            
                                How to use backreferences beyond 9 in a PostgreSQL regular expression?
                            
                                R regex: issues with character vectors containing NAs
                            
                                Snobol Pattern Matching [closed]
                            
                                %LIKE% retrieves eg. WOMEN DATA when calling MEN
                            
                                Selecting columns whose name matches a regular expression in PostgreSQL
                            
                                How to search only in diff blocks - gvim diff
                            
                                Regex to pick a part of a word
                            
                                RewriteRule is not working with plus (+ or *) character
                            
                                Python regex match fails with UTF-8 characters
                            
                                RegExp detect multiple single-letter instances in a row?
                            
                                Finding words in any order with regex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With