Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:

set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
                              "some text 20022008", "another indicator 2003"),
                  values = runif(n = 4))

Desired results

Desired results should look like that:

          indicator   period    values
1     someindicator     2001 0.2655087
2     someindicator     2011 0.3721239
3         some text 20022008 0.5728534
4 another indicator     2003 0.9082078

Characteristics

  1. Indicator descriptions are in one column
  2. Numeric values (counting from first digit with the first digit are in the second column)

Code

require(dplyr); require(tidyr); require(magrittr)
dta %<>%
  separate(col = indicator, into = c("indicator", "period"),
           sep = "^[^\\d]*(2+)", remove = TRUE)

Naturally this does not work:

> head(dta, 2)
  indicator period    values
1              001 0.2655087
2              011 0.3721239

Other attempts

  • I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
  • The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).

What I'm trying to do boils down to:

  • Identifying the first digit in the string
  • Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
like image 314
Konrad Avatar asked Jan 17 '16 19:01

Konrad


2 Answers

I think this might do it.

library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
#           indicator   period    values
# 1     someindicator     2001 0.2655087
# 2     someindicator     2011 0.3721239
# 3         some text 20022008 0.5728534
# 4 another indicator     2003 0.9082078

The following is an explanation of the regular expression, brought to you by regex101.

  • (?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
  • ? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
  • (?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
like image 76
Rich Scriven Avatar answered Nov 11 '22 18:11

Rich Scriven


You could also use unglue::unnest() :

dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
                              "some text 20022008", "another indicator 2003"),
                  values = runif(n = 4))

# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#>       values         indicator   period
#> 1 0.43234262     someindicator     2001
#> 2 0.65890900     someindicator     2011
#> 3 0.93576805         some text 20022008
#> 4 0.01934736 another indicator     2003

Created on 2019-09-14 by the reprex package (v0.3.0)

like image 24
Moody_Mudskipper Avatar answered Nov 11 '22 16:11

Moody_Mudskipper