Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex extraction data before vs after comma in R

Tags:

regex

r

gsub

I am a regex beginner, as I don't usually process text. I have a very simple question. I managed to construct the following regex to extract data after a comma:

sub('.*,\\s*','', X)

where X is the column I am searching.

I now separately want to extract the data before the comma, but am struggling with the regex syntax. Appreciate the help.

like image 217
RichS Avatar asked Oct 12 '15 03:10

RichS


2 Answers

The following expression:

sub('\\s*,.*','', X)

replaces everything from the last comma to the end of line with an empty string. Therefore, it will return the text before the last comma in the string.

like image 182
MrFreezer Avatar answered Nov 02 '22 08:11

MrFreezer


Your regex

sub('.*,\\s*','', X)

is not extracting text, it is substituting the second param for what is matched by the first. So, everything that matches a bunch of characters followed by a comma followed by a space character in X gets replaced with nothing in this regex.

You can see what you are hitting in the demo linked above. I am not certain what you are trying to achieve, but if you want to match the text that sits before a comma in your text, this regex will match it and here is how you would also replace it with your previous replacement in your sub

In R

X2 = "here is another test string, with following text"
Y <- sub('.*(,.*)','', X2)

yielding

> Y
[1] ", with following text"

In R, your code produces:

X = "here is a test string, "
Y <- sub('.*,\\s*','\\1', X)

yielding

> Y
[1] ""
like image 28
Shawn Mehan Avatar answered Nov 02 '22 10:11

Shawn Mehan