Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating new columns by splitting a variable into many variables (in R)

I want to create new columns by splitting a vector in a data frame.

I have such a data frame:

YEAR Variable1 Variable2 
2009 000000    00000001
2010 000000    00000001
2011 000000    00000001
2009 000000    00000002
2010 000000    00000002
2009 000000    00000003
...
2009 100000    10000001
2010 100000    10000001
...
2009 100000    10000011
....

As you can see Variable2 is related to Variable 1 (Variable2=Variable1+last two digits: e.g. 01, 02, 03... indicating subcategories). I want to split Variable2 in so many variables as the subcategories are. The result should be:

YEAR Variable1 Variable2 Variable3 Variable4 ... 
2009 000000    00000001  0         0        
2010 000000    00000001  0         0
2011 000000    00000001  0         0
2009 000000    0         00000002  0
2010 000000    0         00000002  0
2009 000000    0         0         00000003
...
2009 100000    10000001  0         0     
2010 100000    10000001  0         0     
...
2009 100000    0         0         0       ...      10000011 

How would you proceed? I thought I should try to recode Variable2 in a loop.. I tried by manipulating strings, but I didn´t solve the problem..

like image 765
woodstock Avatar asked Jun 03 '15 13:06

woodstock


3 Answers

This will work. First let's build the data.

values <- paste0("0000000", 1:4)
library(data.table)
dt <- data.table(val = sample(values, 10, replace = TRUE))

A for loop is enough to define the new columns.

for(level_var in dt[, unique(val)]){
  dt[, eval(level_var) := ifelse(val == level_var, level_var, 0)]
}
like image 101
Michele Usuelli Avatar answered Sep 23 '22 20:09

Michele Usuelli


Here's another suggestion. The code is somewhat longer, but I believe it does the trick, and I hope that it can be understood easily. I'm assuming that the original data is stored in a tab-separated file named 'data.dat'. The output of the code is stored in the matrix 'new_matrix'. The entries are characters, but it shouldn't be a problem to convert them into integers if necessary.

data <- read.table('data.dat', sep='\t', header = TRUE, colClasses = "character")
var2 <- data[3]
nc <- nchar(var2[1,1])
last2 <-substr(var2[,1],nc-1,nc)
subcat <-levels(factor(last2))
mrows <- nrow(data)
mcols <- length(subcat)
varnames <-paste0("Variable",as.character(c(1:(mcols+1))))
new_matrix <- matrix(paste(replicate(nc,"0"),collapse=""),nrow=mrows,ncol=mcols+2)
colnames(new_matrix) <- c("YEAR",varnames)
new_matrix[,1]<-data[,1]
new_matrix[,2]<-data[,2]
for (i in 1:mcols) {
    relevant_rows <- which(last2 == subcat[i])
    new_matrix[relevant_rows,i+2]<-data[relevant_rows,3]
}

Hope this helps.

like image 34
RHertel Avatar answered Sep 21 '22 20:09

RHertel


Using reshape2. A one-line solution. Another line if we'd like to remove the NA values.

library(reshape2)
df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009),
                 Var1=c('000000','000000','000000','000000','000000','000000','100000','100000','100000'),
                 Var2=c('0000001','0000001','0000001','0000002','0000002','0000003','1000001','1000001','1000011')) 
df <- dcast(df, YEAR + Var1 + Var2 ~ Var2, value.var = "Var2")[, -3]
df[is.na(df)] <- 0

Result:

  YEAR   Var1 0000001 0000002 0000003 1000001 1000011
1 2009 000000 0000001       0       0       0       0
2 2009 000000       0 0000002       0       0       0
3 2009 000000       0       0 0000003       0       0
4 2009 100000       0       0       0 1000001       0
5 2009 100000       0       0       0       0 1000011
6 2010 000000 0000001       0       0       0       0
7 2010 000000       0 0000002       0       0       0
8 2010 100000       0       0       0 1000001       0
9 2011 000000 0000001       0       0       0       0
like image 26
mpalanco Avatar answered Sep 24 '22 20:09

mpalanco