tidyr separate column values into character and numeric using regex

Tags:

I'd like to separate column values using tidyr::separate and a regex expression but am new to regex expressions

df <- data.frame(A=c("enc0","enc10","enc25","enc100","harab0","harab25","harab100","requi0","requi25","requi100"), stringsAsFactors=F)

This is what I've tried

library(tidyr)
df %>%
   separate(A, c("name","value"), sep="[a-z]+")

Bad Output

   name value
1           0
2          10
3          25
4         100
5           0
# etc

How do I save the name column as well?

761

asked Aug 09 '17 12:08

CPak

Video Answer

3 Answers

You may use a (?<=[a-z])(?=[0-9]) lookaround based regex with tidyr::separate:

> tidyr::separate(df, A, into = c("name", "value"), "(?<=[a-z])(?=[0-9])")
    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The (?<=[a-z])(?=[0-9]) pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])) and a digit ((?=[0-9])). The (?<=...) is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and (?=...) is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.

Alternatively, you may use extract:

extract(df, A, into = c("name", "value"), "^([a-z]+)(\\d+)$")

Output:

    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The ^([a-z]+)(\\d+)$ pattern matches:

^ - start of input
([a-z]+) - Capturing group 1 (column name): one or more lowercase ASCII letters
(\\d+) - Capturing group 2 (column value): one or more digits
$ - end of string.

139

answered Oct 13 '22 09:10

Wiktor Stribiżew

For a bare R version without a lookaround-based regex, define the regular expression first:

> re <- "[a-zA-Z][0-9]"

Then use two substr() commands to separate and return the desired two components, before and after the matched pattern.

> with(df,
      data.frame(name=substr(A, 1L, regexpr(re, A)), 
                 value=substr(A, regexpr(re, A) + 1L, 1000L))
      )
    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The regex here looks for the pattern "any alpha" [a-zA-Z] followed by "any numeric" [0-9]. I believe this is what the reshape command does if the sep argument is specified as "".

answered Oct 13 '22 09:10

Edward

You can add one more step If you really want to get it with separate, in which I don't see the point, i.e. (Using the same regex as @ WiktorStribiżew),

df %>% 
  mutate(A = gsub('^([a-z]+)(\\d+)$', '\\1_\\2', A)) %>% 
  separate(A, into = c('name', 'value'), sep = '_')

answered Oct 13 '22 09:10

Sotos

Related questions
                            
                                How to create a random matching between the rows of two data.tables (or data.frames)
                            
                                Most elegant way to load csv with point as thousands separator in R
                            
                                How to get the mode of a group in summarize in R
                            
                                Reset row selection for DT::renderDataTable() in Shiny R
                            
                                how to download file (any form) from dropbox using R
                            
                                Easy export and table formatting of R dataframe to Word? [closed]
                            
                                How to compress saves in R package build
                            
                                Fastest way to generate random boolean vector
                            
                                Conditional colouring of a geom_smooth
                            
                                Use sprintf() to add trailing zeros
                            
                                How do I resolve "no package called '.GlobalEnv'" error in R?
                            
                                Getting "invalid 'type' (character) of argument" error with aggregate()
                            
                                stringr, str_extract: how to do positive lookbehind?
                            
                                in R, save a shapefile
                            
                                Unable to install R package due to XML dependency mismatch
                            
                                Find empty lists in nested list of lists
                            
                                Why does datatable not print when looping in rmarkdown?
                            
                                grid.arrange ggplot2 plots by columns instead of by row using lists
                            
                                Knit error. Object not found
                            
                                Automatic caret parameter tuning fails in glmnet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

tidyr separate column values into character and numeric using regex

Tags:

regex

r

tidyr

CPak

People also ask

Video Answer

3 Answers

Wiktor Stribiżew

Edward

Sotos

Recent Activity

Donate For Us