Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using gsub to extract character string before white space in R

Tags:

r

I have a list of birthdays that look something like this:

dob <- c("9/9/43 12:00 AM/PM", "9/17/88 12:00 AM/PM", "11/21/48 12:00 AM/PM") 

I want to just grab the calendar date from this variable (ie drop everything after the first occurrence of white-space).

Here's what I have tried so far:

dob.abridged <- substring(dob,1,8) dob [1] "9/9/43 1" "9/17/88 " "11/21/48" dob.abridged <- gsub(" $","", dob.abridged, perl=T) > dob.abridged [1] "9/9/43 1" "9/17/88"  "11/21/48" 

So my code works for calendar dates of length 6 or 7, but not length 8. Any pointers on a more effective regex to use with gsub that can handle calendar dates of length 6, 7 or 8?

Thank you.

like image 401
Anupa Fabian Avatar asked Apr 09 '13 06:04

Anupa Fabian


People also ask

How do I extract part of a string in R?

The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().

How do I extract a specific word from a string in R?

To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).

How do you get special characters in GSUB?

To be able to use special characters within a function such as gsub, we have to add two backslashes (i.e. \\) in front of the special character.

How does GSUB work in R?

The gsub() function in R is used for replacement operations. The functions takes the input and substitutes it against the specified values. The gsub() function always deals with regular expressions. You can use the regular expressions as the parameter of substitution.


2 Answers

No need for substring, just use gsub:

gsub( " .*$", "", dob ) # [1] "9/9/43"   "9/17/88"  "11/21/48" 

A space (), then any character (.) any number of times (*) until the end of the string ($). See ?regex to learn regular expressions.

like image 145
Romain Francois Avatar answered Sep 18 '22 19:09

Romain Francois


I often use strsplit for these sorts of problems but liked how simple Romain's answer was. I thought it would be interesting to compare Romain's solution to a strsplit answer:

Here's a strsplit solution:

sapply(strsplit(dob, "\\s+"), "[", 1) 

Using the microbenchmark package and dob <- rep(dob, 1000) with the original data:

Unit: milliseconds                                     expr       min        lq    median                    gsub(" .*$", "", dob)  4.228843  4.247969  4.258232  sapply(strsplit(dob, "\\\\s+"), "[", 1) 14.438241 14.558832 14.634638         uq       max neval   4.268029  5.081608  1000  14.756628 53.344984  1000 

The clear winner on a Win 7 machine is the gsub regex from Romain. Thanks for the answer and explanation Romain.

like image 25
Tyler Rinker Avatar answered Sep 18 '22 19:09

Tyler Rinker