Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace words between two punctuations

Tags:

regex

r

I have a dataset that looks like the following

sentence <-  
    "active ingredients: avobenzone, octocrylene, octyl salicylate. 
    other stuff inactive ingredients: water, glycerin, edta."

And I am trying to get

    "avobenzone, octocrylene, octyl salicylate, water, glycerin, edta."

The logic that I'm thinking in plain English is match on anything that is between a punctuation and a semi-colon to remove them. OR, match between beginning of string and semi-colon and remove them. I am using gsub in r and have gotten so far to here:

     gsub("([:punct:][^:]*:)|^([^:]*:)", "", sentence)

but my result is this...

    [1] " avobe water, glycerin, edta."

Why is this catching everything between the the first word all the way to the last semi-colon instead of the first? Can someone point me to the right direction to understand this logic?

Thank you!

like image 824
sir_chocolate_soup Avatar asked Mar 21 '18 22:03

sir_chocolate_soup


1 Answers

At least one way is:

gsub(".*?:\\s*(.*?)\\.", "\\1, ", sentence)
[1] "avobenzone, octocrylene, octyl salicylate, water, glycerin, edta, "

Notice the ? after .* That makes the matching be not greedy. Without the ?, .* matches as much as possible.

Addition:

The idea of this is to replace everything except the part that you want with nothing. You said that you wanted to stop at punctuation marks, but you obviously did not want to stop at commas, so I took the liberty of interpreting the problem as finding the parts of the sting between colon and period. In my expression, .*?: matches everything up to the first colon. I put in \\s* to also cut out any blank spaces that might follow the colon. We want everything after that up to the next period. That is represented by .*?\\. BUT we want to keep that part so I put it in parentheses to make it a 'capture group'. Because it is in parens, whatever is between the colon and the period will be stored in the variable called \1 (but you have to type \\1 to get the string \1). I also added ", " (comma-blank) to the end of the capture group to help separate it from whatever comes next. SO This will take active ingredients: avobenzone, octocrylene, octyl salicylate. and replace it with avobenzone, octocrylene, octyl salicylate, . Since I used gsub (global substitution), it will then start over and try to do the same thing to the rest of the string, replacing other stuff inactive ingredients: water, glycerin, edta. with water, glycerin, edta, . Sorry about the ugly trailing ", ".

like image 84
G5W Avatar answered Sep 21 '22 03:09

G5W