Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split column name and convert data from wide to long format in R

Tags:

r

reshape

I have a large dataset that I need to convert to long format from wide format. That should be simple enough and there are lots of examples of how to do that on this forum. However, in this case, I also need to split the column headers used in the wide format and create a column for each of them in the long format.

Example dataset

 data <- data.frame("East2010"=1:3, "West2010"=4:6, "East2011"=7:9, "West2011"=5:7)
 data
 East.2010 West.2010 East.2011 West.2011
 1         1         4         7         5
 2         2         5         8         6
 3         3         6         9         7

What I want is something like this

 Site   Year   Response
 East   2010   1
 East   2010   2
 East   2010   3
 West   2010   4
 West   2010   5
 West   2010   6
 East   2011   7
 East   2011   8
 East   2011   9
 West   2011   5
 West   2011   6
 West   2011   7

I have looked a lot of examples on this forum that will melt data to convert to long format and others that do a column split at a delimiter, but I have not been able to make the two work together.

like image 699
user3799924 Avatar asked Aug 12 '14 18:08

user3799924


People also ask

How do I convert wide data to long data in R?

The easiest way to reshape data between these formats is to use the following two functions from the tidyr package in R: pivot_longer(): Reshapes a data frame from wide to long format. pivot_wider(): Reshapes a data frame from long to wide format.

How do I split a column into values in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.

What is long format and wide format in R?

A dataset can be written in two different formats: wide and long. A wide format contains values that do not repeat in the first column. A long format contains values that do repeat in the first column.

What is long format in R?

When there are multiple measurements of the same subject, across time or using different tools, the data is often described as being in "wide" format if there is one observation row per subject with each measurement present as a different variable and "long" format if there is one observation row per measurement (thus, ...


3 Answers

Here's the "modern day" :-) approach for this:

library(dplyr)
library(tidyr)
data %>%
  gather(var, Response, East2010:West2011) %>%  ## Makes wide data long
  separate(var, c("Site", "Year"), sep = -5)    ## Splits up a column
#    Site Year Response
# 1  East 2010        1
# 2  East 2010        2
# 3  East 2010        3
# 4  West 2010        4
# 5  West 2010        5
# 6  West 2010        6
# 7  East 2011        7
# 8  East 2011        8
# 9  East 2011        9
# 10 West 2011        5
# 11 West 2011        6
# 12 West 2011        7

The sep = -5 from above says to go from the end of the string backwards five characters and split there. Thus, if you had "North2010" as a possible name, this would still work.

That said, it is more reliable to use a regular expression like @David's, which is also possible within separate:

data %>%
  gather(var, Response, East2010:West2011) %>%
  separate(var, c("Site", "Year"), 
           sep = "(?<=[[:alpha:]])(?=[[:digit:]])", 
           perl = TRUE)
like image 87
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 20 '22 23:10

A5C1D2H2I1M1N2O1R2T1


Or (in case the column width is not always constant). Here I use a "lookahead" and "lookbehind" in order to separate characters from digits.

library(reshape2)
data <- melt(data)
temp <- strsplit(as.character(data$variable), "(?<=[[:alpha:]])(?=[[:digit:]])", perl = TRUE)
transform(data, Site = sapply(temp, "[", 1), Year = sapply(temp, "[", 2))

#   variable value Site Year
#1  East2010     1 East 2010
#2  East2010     2 East 2010
#3  East2010     3 East 2010
#4  West2010     4 West 2010
#5  West2010     5 West 2010
#6  West2010     6 West 2010
#7  East2011     7 East 2011
#8  East2011     8 East 2011
#9  East2011     9 East 2011
#10 West2011     5 West 2011
#11 West2011     6 West 2011
#12 West2011     7 West 2011
like image 3
David Arenburg Avatar answered Oct 20 '22 23:10

David Arenburg


Something along these lines would work:

library("plyr")
library("reshape2")
m.data <- melt(data)
m.data <- mutate(m.data, Site=substr(variable, 1,4), 
    Year=substr(variable, 5,8))

Which would result in:

> m.data
   variable value Site Year
1  East2010     1 East 2010
2  East2010     2 East 2010
3  East2010     3 East 2010
4  West2010     4 West 2010
5  West2010     5 West 2010
6  West2010     6 West 2010
7  East2011     7 East 2011
8  East2011     8 East 2011
9  East2011     9 East 2011
10 West2011     5 West 2011
11 West2011     6 West 2011
12 West2011     7 West 2011
like image 2
landroni Avatar answered Oct 20 '22 22:10

landroni