Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tidyr separate column values into character and numeric using regex

Tags:

regex

r

tidyr

I'd like to separate column values using tidyr::separate and a regex expression but am new to regex expressions

df <- data.frame(A=c("enc0","enc10","enc25","enc100","harab0","harab25","harab100","requi0","requi25","requi100"), stringsAsFactors=F) 

This is what I've tried

library(tidyr)
df %>%
   separate(A, c("name","value"), sep="[a-z]+")

Bad Output

   name value
1           0
2          10
3          25
4         100
5           0
# etc

How do I save the name column as well?

like image 761
CPak Avatar asked Aug 09 '17 12:08

CPak


People also ask

How do I split a column by a character in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.

How do you separate data in R?

Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.


Video Answer


3 Answers

You may use a (?<=[a-z])(?=[0-9]) lookaround based regex with tidyr::separate:

> tidyr::separate(df, A, into = c("name", "value"), "(?<=[a-z])(?=[0-9])")
    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The (?<=[a-z])(?=[0-9]) pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])) and a digit ((?=[0-9])). The (?<=...) is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and (?=...) is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.

Alternatively, you may use extract:

extract(df, A, into = c("name", "value"), "^([a-z]+)(\\d+)$")

Output:

    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The ^([a-z]+)(\\d+)$ pattern matches:

  • ^ - start of input
  • ([a-z]+) - Capturing group 1 (column name): one or more lowercase ASCII letters
  • (\\d+) - Capturing group 2 (column value): one or more digits
  • $ - end of string.
like image 139
Wiktor Stribiżew Avatar answered Oct 13 '22 09:10

Wiktor Stribiżew


For a bare R version without a lookaround-based regex, define the regular expression first:

> re <- "[a-zA-Z][0-9]"

Then use two substr() commands to separate and return the desired two components, before and after the matched pattern.

> with(df,
      data.frame(name=substr(A, 1L, regexpr(re, A)), 
                 value=substr(A, regexpr(re, A) + 1L, 1000L))
      )
    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The regex here looks for the pattern "any alpha" [a-zA-Z] followed by "any numeric" [0-9]. I believe this is what the reshape command does if the sep argument is specified as "".

like image 40
Edward Avatar answered Oct 13 '22 09:10

Edward


You can add one more step If you really want to get it with separate, in which I don't see the point, i.e. (Using the same regex as @ WiktorStribiżew),

df %>% 
  mutate(A = gsub('^([a-z]+)(\\d+)$', '\\1_\\2', A)) %>% 
  separate(A, into = c('name', 'value'), sep = '_')
like image 25
Sotos Avatar answered Oct 13 '22 09:10

Sotos