I've been provided with a couple of very awkwardly formatted data in excel which I need to reshape so it fits to run a survival analysis in R.
I uploaded an extract of the data to Google drive: https://drive.google.com/open?id=1ret3bCDCYPDALQ16YBloaeopfl2-qVbp
The original data frame has about 2100 observations and 950 variables
Here is the basic data frame:
my.data<-data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1)
)
my.data
# ID LR LR.1 LR.2
# 1 2012Y 2012Y 2011Y
# 2 State:FL State:AZ State:FL
# 3 C8477 5 5 7
# 4 C5273 6 8 2
# 5 C5566 8 10 1
All the columns have the same name "LR". I don't know if this will be a problem later...
The Year is given in row 1 and the according state the observation happened in row 2.
As output I need to have some panel data that I work with in later survival analysis.
my.data<-data.frame(
ID=c("C8477","C5273","C5566"),
Year=c("2012","2012","2011"),
State=c("FL","AZ","FL"),LR=c(5,8,1)
)
my.data
# ID Year State LR
# 1 C8477 2012 FL 5
# 2 C5273 2012 AZ 8
# 3 C5566 2011 FL 1
I played around with the reshape function and seq functions, but non of that will help me move in the right direction, as the data frame is so oddly arranged.
Melting in R It is performed using melt() function which takes dataset and column values that has to be kept constant. Using melt(), dataframe is converted into long format and stretches the data frame.
The easiest way to reshape data between these formats is to use the following two functions from the tidyr package in R: pivot_longer(): Reshapes a data frame from wide to long format. pivot_wider(): Reshapes a data frame from long to wide format.
Description. This function reshapes a data frame between 'wide' format with repeated measurements in separate columns of the same record and 'long' format with the repeated measurements in separate records.
This is a tidyverse
approach:
my.data <- data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1)
)
my code:
library(tidyverse)
year <- as.matrix(my.data[1, -1])
year <- str_split(year, "Y", simplify = T)[,1]
state <-as.matrix(my.data[2, -1])
both<-paste(state, year, sep = "_")
mydata1<-my.data[-c(1, 2), ]
colnames(mydata1) <-c("ID", both)
long <-pivot_longer(mydata1,
cols = starts_with("state"),
names_to = "State_year",
values_to = "LR")
long %>%
transmute(
ID, LR,
state = str_split(State_year, "_", simplify = T)[, 1],
state = str_split(state, ":", simplify = T)[, 2],
year = str_split(State_year, "_", simplify = T)[, 2]
)
We get:
ID LR state year
1 C8477 5 FL 2012
2 C8477 5 AZ 2012
3 C8477 7 FL 2011
4 C5273 6 FL 2012
5 C5273 8 AZ 2012
6 C5273 2 FL 2011
7 C5566 8 FL 2012
8 C5566 10 AZ 2012
9 C5566 1 FL 2011
Here is the reshape2
and tidyr
version of achieving this:
library(tidyr)
library(reshape2)
my.data <- data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1)
)
# Combine first two rows as column names
colnames(my.data) <- paste(unlist(my.data[2, ]), unlist(my.data[1, ]), sep = "|")
# Remove first two rows from data
my.data <- my.data[-c(1:2), ] # negative index removes rows
# Melt data
my.data.long <- melt(
my.data,
id.vars = 1L, # would be better to have explicit col name
value.name = "LR"
)
colnames(my.data.long) <- c("ID", "state_year", "LR")
# Split state_year column into two columns:
my.data.long <- separate(
my.data.long,
state_year,
into = c("State", "Year"),
sep = "\\|" # note this is a regex
)
Idea was borrowed here.
A possible base-R approach using reshape
:
## 1) extract the State and Year rows in a separate data.frame
dat.meta <- data.frame(
Year = unlist(dat[1, -1]),
State = sub("State:", "", dat[2, -1]),
LR.id = colnames(dat)[-1]
)
## 2) reshape the data without State and Year rows into long format
dat.long <- reshape(
data = dat[-c(1, 2), ],
varying = 2:4,
direction = "long",
v.names = "LR",
timevar = "LR.id",
times = colnames(dat)[-1]
)
dat.long$id <- NULL ## remove id-column
## 3) merge the reshaped data with the extracted State and Year data
merge(dat.long, dat.meta, by = "LR.id")
#> LR.id ID LR Year State
#> 1 LR C8477 5 2012Y FL
#> 2 LR C5273 6 2012Y FL
#> 3 LR C5566 8 2012Y FL
#> 4 LR.1 C8477 5 2012Y AZ
#> 5 LR.1 C5273 8 2012Y AZ
#> 6 LR.1 C5566 10 2012Y AZ
#> 7 LR.2 C8477 7 2011Y FL
#> 8 LR.2 C5273 2 2011Y FL
#> 9 LR.2 C5566 1 2011Y FL
Data
dat <- data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1),
stringsAsFactors = FALSE
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With