Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split camelCase Column names

Tags:

regex

r

dplyr

tidyr

I've been trying to figure this out for a while, and thought I would ask here.

Say I have a data frame like the following:

df <- data.frame(participant = 1:6, group = c("adult", "adult", "child", "child", "NSS", "NSS"), RegProto = c(2, 3, 4, 2, 4, 3), RegInt = c(2, 3, 4, 6, 6, 5), RegDistant = c(3, 3, 4, 5, 4, 5), IrregProto = c(4, 5, 3, 4, 3, 1), IrregInt = c(4, 4, 4, 4, 4, 4), IrregDistant = c(4, 5, 6, 8, 9, 1))

The problem with this data frame is that each contains two variables: one variable whose values are either Reg or Irreg, another whose values are Proto, Int, or Distant. What I would like to do is split these columns and make the table long, preferably using tidyr. I thought I could do it like this.

library("tidyr")
df_long <- df %>%
gather(index, n, -group, -participant) %>%
select(participant, group, index, n) %>%
separate(index, into = c("verb", "similarity"), sep = "\\.?=\\p{Upper}")

This does what I want until separate(). I get an error message saying that the values were not split, but no other suggestions as to why that might be. I'm new to regex, so I suspect the problem must be there, but I can't figure out what the correct syntax might be.

like image 440
JoeF Avatar asked Jan 19 '15 15:01

JoeF


1 Answers

You can use this regex:

(?<=.)(?=[A-Z])

This indicates the (zero-length) position followed by an uppercase letter and preceded by any character.

The command:

library(dplyr)
df %>%
  gather(index, n, -group, -participant) %>%
  select(participant, group, index, n) %>%
  separate(index, into = c("verb", "similarity"), sep = "(?<=.)(?=[A-Z])")
like image 96
Sven Hohenstein Avatar answered Sep 21 '22 11:09

Sven Hohenstein