I've been provided with a couple of very awkwardly formatted data in excel which I need to reshape so it fits to run a survival analysis in R. I uploaded an extract of the data to Google drive: https://drive.google.com/open?id=1ret3bCDCYPDALQ16YBloaeopfl2-qVbp The original data frame has about 2100 observations and 950 variables Here is the basic data frame: <pre class="prettyprint"><code>my.data<-data.frame( ID=c( "", "","C8477","C5273","C5566"), LR=c("2012Y","State:FL",5,6,8), LR=c("2012Y","State:AZ",5,8,10), LR=c("2011Y","State:FL",7,2,1) ) my.data # ID LR LR.1 LR.2 # 1 2012Y 2012Y 2011Y # 2 State:FL State:AZ State:FL # 3 C8477 5 5 7 # 4 C5273 6 8 2 # 5 C5566 8 10 1 </code></pre> All the columns have the same name "LR". I don't know if this will be a problem later... The Year is given in row 1 and the according state the observation happened in row 2. As output I need to have some panel data that I work with in later survival analysis. <pre class="prettyprint"><code>my.data<-data.frame( ID=c("C8477","C5273","C5566"), Year=c("2012","2012","2011"), State=c("FL","AZ","FL"),LR=c(5,8,1) ) my.data # ID Year State LR # 1 C8477 2012 FL 5 # 2 C5273 2012 AZ 8 # 3 C5566 2011 FL 1 </code></pre> I played around with the reshape function and seq functions, but non of that will help me move in the right direction, as the data frame is so oddly arranged.

This is a <code>tidyverse</code> approach: <pre class="prettyprint"><code>my.data <- data.frame( ID=c( "", "","C8477","C5273","C5566"), LR=c("2012Y","State:FL",5,6,8), LR=c("2012Y","State:AZ",5,8,10), LR=c("2011Y","State:FL",7,2,1) ) </code></pre> my code: <pre class="prettyprint"><code>library(tidyverse) year <- as.matrix(my.data[1, -1]) year <- str_split(year, "Y", simplify = T)[,1] state <-as.matrix(my.data[2, -1]) both<-paste(state, year, sep = "_") mydata1<-my.data[-c(1, 2), ] colnames(mydata1) <-c("ID", both) long <-pivot_longer(mydata1, cols = starts_with("state"), names_to = "State_year", values_to = "LR") long %>% transmute( ID, LR, state = str_split(State_year, "_", simplify = T)[, 1], state = str_split(state, ":", simplify = T)[, 2], year = str_split(State_year, "_", simplify = T)[, 2] ) </code></pre> We get: <pre class="prettyprint"><code> ID LR state year 1 C8477 5 FL 2012 2 C8477 5 AZ 2012 3 C8477 7 FL 2011 4 C5273 6 FL 2012 5 C5273 8 AZ 2012 6 C5273 2 FL 2011 7 C5566 8 FL 2012 8 C5566 10 AZ 2012 9 C5566 1 FL 2011 </code></pre>

Here is the <code>reshape2</code> and <code>tidyr</code> version of achieving this: <pre class="prettyprint"><code>library(tidyr) library(reshape2) my.data <- data.frame( ID=c( "", "","C8477","C5273","C5566"), LR=c("2012Y","State:FL",5,6,8), LR=c("2012Y","State:AZ",5,8,10), LR=c("2011Y","State:FL",7,2,1) ) # Combine first two rows as column names colnames(my.data) <- paste(unlist(my.data[2, ]), unlist(my.data[1, ]), sep = "|") # Remove first two rows from data my.data <- my.data[-c(1:2), ] # negative index removes rows # Melt data my.data.long <- melt( my.data, id.vars = 1L, # would be better to have explicit col name value.name = "LR" ) colnames(my.data.long) <- c("ID", "state_year", "LR") # Split state_year column into two columns: my.data.long <- separate( my.data.long, state_year, into = c("State", "Year"), sep = "\\|" # note this is a regex ) </code></pre> Idea was borrowed here.

Reshape data in R with fixed effect information within column

Tags:

stack

r

reshape

I've been provided with a couple of very awkwardly formatted data in excel which I need to reshape so it fits to run a survival analysis in R.

I uploaded an extract of the data to Google drive: https://drive.google.com/open?id=1ret3bCDCYPDALQ16YBloaeopfl2-qVbp
The original data frame has about 2100 observations and 950 variables

Here is the basic data frame:

my.data<-data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

my.data

#     ID       LR     LR.1     LR.2
# 1          2012Y    2012Y    2011Y
# 2       State:FL State:AZ State:FL
# 3 C8477        5        5        7
# 4 C5273        6        8        2
# 5 C5566        8       10        1

All the columns have the same name "LR". I don't know if this will be a problem later...

The Year is given in row 1 and the according state the observation happened in row 2.

As output I need to have some panel data that I work with in later survival analysis.

my.data<-data.frame(
  ID=c("C8477","C5273","C5566"),
  Year=c("2012","2012","2011"), 
  State=c("FL","AZ","FL"),LR=c(5,8,1)
) 

my.data

#     ID Year State LR
# 1 C8477 2012    FL  5
# 2 C5273 2012    AZ  8
# 3 C5566 2011    FL  1

I played around with the reshape function and seq functions, but non of that will help me move in the right direction, as the data frame is so oddly arranged.

846

asked Oct 13 '19 19:10

Juan

3 Answers

This is a tidyverse approach:

my.data <- data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

my code:

library(tidyverse)
year <- as.matrix(my.data[1, -1])
year <- str_split(year, "Y", simplify = T)[,1]
state <-as.matrix(my.data[2, -1])
both<-paste(state, year, sep = "_")
mydata1<-my.data[-c(1, 2), ]
colnames(mydata1) <-c("ID", both)
long <-pivot_longer(mydata1, 
             cols = starts_with("state"),
             names_to = "State_year",
             values_to = "LR")
long %>%
  transmute(
    ID, LR, 
    state = str_split(State_year, "_", simplify = T)[, 1],
    state = str_split(state, ":", simplify = T)[, 2], 
    year = str_split(State_year, "_", simplify = T)[, 2]
)

We get:

  ID    LR    state year 
1 C8477 5     FL    2012 
2 C8477 5     AZ    2012 
3 C8477 7     FL    2011 
4 C5273 6     FL    2012 
5 C5273 8     AZ    2012 
6 C5273 2     FL    2011 
7 C5566 8     FL    2012 
8 C5566 10    AZ    2012 
9 C5566 1     FL    2011

192

answered Oct 22 '22 15:10

Zhiqiang Wang

Here is the reshape2 and tidyr version of achieving this:

library(tidyr)
library(reshape2)

my.data <- data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

# Combine first two rows as column names
colnames(my.data) <- paste(unlist(my.data[2, ]), unlist(my.data[1, ]), sep = "|")
# Remove first two rows from data
my.data <- my.data[-c(1:2), ] # negative index removes rows

# Melt data
my.data.long <- melt(
  my.data, 
  id.vars = 1L, # would be better to have explicit col name
  value.name = "LR" 
)
colnames(my.data.long) <- c("ID", "state_year", "LR")

# Split state_year column into two columns:
my.data.long <- separate(
  my.data.long, 
  state_year, 
  into = c("State", "Year"), 
  sep = "\\|" # note this is a regex
)

Idea was borrowed here.

answered Oct 22 '22 16:10

Bulat

A possible base-R approach using reshape:

## 1) extract the State and Year rows in a separate data.frame
dat.meta <- data.frame(
    Year = unlist(dat[1, -1]),
    State = sub("State:", "", dat[2, -1]),
    LR.id = colnames(dat)[-1]
)

## 2) reshape the data without State and Year rows into long format
dat.long <- reshape(
    data = dat[-c(1, 2), ], 
    varying = 2:4,
    direction = "long",
    v.names = "LR",
    timevar = "LR.id",
    times = colnames(dat)[-1]
)

dat.long$id <- NULL  ## remove id-column   

## 3) merge the reshaped data with the extracted State and Year data
merge(dat.long, dat.meta, by = "LR.id")

#>   LR.id    ID LR  Year State
#> 1    LR C8477  5 2012Y    FL
#> 2    LR C5273  6 2012Y    FL
#> 3    LR C5566  8 2012Y    FL
#> 4  LR.1 C8477  5 2012Y    AZ
#> 5  LR.1 C5273  8 2012Y    AZ
#> 6  LR.1 C5566 10 2012Y    AZ
#> 7  LR.2 C8477  7 2011Y    FL
#> 8  LR.2 C5273  2 2011Y    FL
#> 9  LR.2 C5566  1 2011Y    FL

Data

dat <- data.frame(
    ID=c( "", "","C8477","C5273","C5566"),
    LR=c("2012Y","State:FL",5,6,8),
    LR=c("2012Y","State:AZ",5,8,10),
    LR=c("2011Y","State:FL",7,2,1),
    stringsAsFactors = FALSE
)

answered Oct 22 '22 16:10

Joris C.

Related questions
                            
                                R Shiny - multi-page editable DataTable jumps to row #1 after an edit
                            
                                Regular expression in Rstudio's "Find in files"
                            
                                Saving S4 objects in a list of list
                            
                                Install R packages from the command line
                            
                                How to identify all sequential numbers not covered by 'to' and 'from' positions?
                            
                                NoSuchMethodError when using Scala in R with rscala
                            
                                How to combine similar strings showing most common characters
                            
                                How to install R 3.4.4 in alpine
                            
                                Is there a way around casting large integers as string when querying data from BigQuery through R?
                            
                                Fast sorted sample with replacement
                            
                                Not enough space Error appears when running for loop for 13K pdf documents
                            
                                Shiny DT: Freeze rownames while sorting?
                            
                                Rmarkdown - Use table name as variable in dynamic sql chunk?
                            
                                Shiny selectInput depending on validated reactive does not pass on validation error
                            
                                Replacing missing values
                            
                                brain-teaser R problem - programming problem solving in R
                            
                                ggplot lines from point to origin and cosine scores
                            
                                header on first page and others
                            
                                `geom_histogram` and `stat_bin()` don't align
                            
                                Checking if there exists a value in vector of dates that lies within a given range

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With