I want to create new columns by splitting a vector in a data frame.
I have such a data frame:
YEAR Variable1 Variable2
2009 000000 00000001
2010 000000 00000001
2011 000000 00000001
2009 000000 00000002
2010 000000 00000002
2009 000000 00000003
...
2009 100000 10000001
2010 100000 10000001
...
2009 100000 10000011
....
As you can see Variable2 is related to Variable 1 (Variable2=Variable1+last two digits: e.g. 01, 02, 03... indicating subcategories). I want to split Variable2 in so many variables as the subcategories are. The result should be:
YEAR Variable1 Variable2 Variable3 Variable4 ...
2009 000000 00000001 0 0
2010 000000 00000001 0 0
2011 000000 00000001 0 0
2009 000000 0 00000002 0
2010 000000 0 00000002 0
2009 000000 0 0 00000003
...
2009 100000 10000001 0 0
2010 100000 10000001 0 0
...
2009 100000 0 0 0 ... 10000011
How would you proceed? I thought I should try to recode Variable2 in a loop.. I tried by manipulating strings, but I didn´t solve the problem..
This will work. First let's build the data.
values <- paste0("0000000", 1:4)
library(data.table)
dt <- data.table(val = sample(values, 10, replace = TRUE))
A for loop is enough to define the new columns.
for(level_var in dt[, unique(val)]){
dt[, eval(level_var) := ifelse(val == level_var, level_var, 0)]
}
Here's another suggestion. The code is somewhat longer, but I believe it does the trick, and I hope that it can be understood easily. I'm assuming that the original data is stored in a tab-separated file named 'data.dat'. The output of the code is stored in the matrix 'new_matrix'. The entries are characters, but it shouldn't be a problem to convert them into integers if necessary.
data <- read.table('data.dat', sep='\t', header = TRUE, colClasses = "character")
var2 <- data[3]
nc <- nchar(var2[1,1])
last2 <-substr(var2[,1],nc-1,nc)
subcat <-levels(factor(last2))
mrows <- nrow(data)
mcols <- length(subcat)
varnames <-paste0("Variable",as.character(c(1:(mcols+1))))
new_matrix <- matrix(paste(replicate(nc,"0"),collapse=""),nrow=mrows,ncol=mcols+2)
colnames(new_matrix) <- c("YEAR",varnames)
new_matrix[,1]<-data[,1]
new_matrix[,2]<-data[,2]
for (i in 1:mcols) {
relevant_rows <- which(last2 == subcat[i])
new_matrix[relevant_rows,i+2]<-data[relevant_rows,3]
}
Hope this helps.
Using reshape2
. A one-line solution. Another line if we'd like to remove the NA values.
library(reshape2)
df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009),
Var1=c('000000','000000','000000','000000','000000','000000','100000','100000','100000'),
Var2=c('0000001','0000001','0000001','0000002','0000002','0000003','1000001','1000001','1000011'))
df <- dcast(df, YEAR + Var1 + Var2 ~ Var2, value.var = "Var2")[, -3]
df[is.na(df)] <- 0
Result:
YEAR Var1 0000001 0000002 0000003 1000001 1000011
1 2009 000000 0000001 0 0 0 0
2 2009 000000 0 0000002 0 0 0
3 2009 000000 0 0 0000003 0 0
4 2009 100000 0 0 0 1000001 0
5 2009 100000 0 0 0 0 1000011
6 2010 000000 0000001 0 0 0 0
7 2010 000000 0 0000002 0 0 0
8 2010 100000 0 0 0 1000001 0
9 2011 000000 0000001 0 0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With