I have a list of birthdays that look something like this:
dob <- c("9/9/43 12:00 AM/PM", "9/17/88 12:00 AM/PM", "11/21/48 12:00 AM/PM")
I want to just grab the calendar date from this variable (ie drop everything after the first occurrence of white-space).
Here's what I have tried so far:
dob.abridged <- substring(dob,1,8) dob [1] "9/9/43 1" "9/17/88 " "11/21/48" dob.abridged <- gsub(" $","", dob.abridged, perl=T) > dob.abridged [1] "9/9/43 1" "9/17/88" "11/21/48"
So my code works for calendar dates of length 6 or 7, but not length 8. Any pointers on a more effective regex to use with gsub that can handle calendar dates of length 6, 7 or 8?
Thank you.
The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().
To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).
To be able to use special characters within a function such as gsub, we have to add two backslashes (i.e. \\) in front of the special character.
The gsub() function in R is used for replacement operations. The functions takes the input and substitutes it against the specified values. The gsub() function always deals with regular expressions. You can use the regular expressions as the parameter of substitution.
No need for substring, just use gsub:
gsub( " .*$", "", dob ) # [1] "9/9/43" "9/17/88" "11/21/48"
A space (), then any character (
.
) any number of times (*
) until the end of the string ($
). See ?regex to learn regular expressions.
I often use strsplit
for these sorts of problems but liked how simple Romain's answer was. I thought it would be interesting to compare Romain's solution to a strsplit
answer:
Here's a strsplit
solution:
sapply(strsplit(dob, "\\s+"), "[", 1)
Using the microbenchmark package and dob <- rep(dob, 1000)
with the original data:
Unit: milliseconds expr min lq median gsub(" .*$", "", dob) 4.228843 4.247969 4.258232 sapply(strsplit(dob, "\\\\s+"), "[", 1) 14.438241 14.558832 14.634638 uq max neval 4.268029 5.081608 1000 14.756628 53.344984 1000
The clear winner on a Win 7 machine is the gsub
regex from Romain. Thanks for the answer and explanation Romain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With