Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split string before next to last character

Tags:

regex

r

I have a numeric variable, DATE, that represents dates where the last two characters are MONTH and the first one or two characters are DAY. I would like to split the column into a separate column for MONTH and DAY.

I can do this with the following R code. Although I was hoping for a simpler regex solution.

my.data <- read.table(text = '
     ID     DATE     VARX
    A111     104        0
    A111     204        1
    A111    1004        4
    A111    2004        4
    B111    3004        2
    C111    3004        3
    C111     105        4
    C111    1005        4
', header = TRUE, stringsAsFactors = FALSE)

# remove the last two characters of a string
my.data$DAY   <- ifelse(nchar(my.data$DATE) == 3,
                        substr(my.data$DATE, nchar(my.data$DATE) - (nchar(my.data$DATE)-1), nchar(my.data$DATE) - (nchar(my.data$DATE)-1)),
                        substr(my.data$DATE, nchar(my.data$DATE) - (nchar(my.data$DATE)-1), nchar(my.data$DATE) - (nchar(my.data$DATE)-2)))

# keep the last two characters of a string

my.data$MONTH <- substr(my.data$DATE, (nchar(my.data$DATE)-1), nchar(my.data$DATE))

    ID DATE VARX DAY MONTH
1 A111  104    0   1    04
2 A111  204    1   2    04
3 A111 1004    4  10    04
4 A111 2004    4  20    04
5 B111 3004    2  30    04
6 C111 3004    3  30    04
7 C111  105    4   1    05
8 C111 1005    4  10    05

Thank you for any suggestions.

like image 896
Mark Miller Avatar asked Dec 20 '22 12:12

Mark Miller


1 Answers

Here are a few alternatives. The first is the most concise. The first two only use base R.

1) numeric manipulation

transform(my.data, MONTH = DATE %% 100, DAY = DATE %/% 100)

giving:

    ID DATE VARX MONTH DAY
1 A111  104    0     4   1
2 A111  204    1     4   2
3 A111 1004    4     4  10
4 A111 2004    4     4  20
5 B111 3004    2     4  30
6 C111 3004    3     4  30
7 C111  105    4     5   1
8 C111 1005    4     5  10

2) sub This gives the same result as in (1).

spl <- function(x, replace) as.numeric(sub("(.*)(..)", replace, x))
transform(my.data, MONTH = spl(DATE, "\\2"), DAY = spl(DATE, "\\1"))

3) strapply applies as.numeric to the part of the match in parentheses and returns it. This gives the same result as in (1).

library(gsubfn)

spl <- function(x, rx) strapply(x, rx, as.numeric, simplify = TRUE)
transform(my.data, MONTH = spl(DATE, ".*(..)"), DAY = spl(DATE, "(.*).."))

Note They all return numeric columns which seems preferable but if you wanted to change that add as.character(...) or an appropriate sprintf in (1), omit as.numeric in (2) or replace as.numeric in (3) with c.

Update Added 2 and 3 and made some improvements.

like image 104
G. Grothendieck Avatar answered Jan 07 '23 02:01

G. Grothendieck