Creating new columns by splitting a variable into many variables (in R)

Question

I want to create new columns by splitting a vector in a data frame.

I have such a data frame:

YEAR Variable1 Variable2 
2009 000000    00000001
2010 000000    00000001
2011 000000    00000001
2009 000000    00000002
2010 000000    00000002
2009 000000    00000003
...
2009 100000    10000001
2010 100000    10000001
...
2009 100000    10000011
....

As you can see Variable2 is related to Variable 1 (Variable2=Variable1+last two digits: e.g. 01, 02, 03... indicating subcategories). I want to split Variable2 in so many variables as the subcategories are. The result should be:

YEAR Variable1 Variable2 Variable3 Variable4 ... 
2009 000000    00000001  0         0        
2010 000000    00000001  0         0
2011 000000    00000001  0         0
2009 000000    0         00000002  0
2010 000000    0         00000002  0
2009 000000    0         0         00000003
...
2009 100000    10000001  0         0     
2010 100000    10000001  0         0     
...
2009 100000    0         0         0       ...      10000011

How would you proceed? I thought I should try to recode Variable2 in a loop.. I tried by manipulating strings, but I didn´t solve the problem..

Michele Usuelli · Accepted Answer

This will work. First let's build the data.

values <- paste0("0000000", 1:4)
library(data.table)
dt <- data.table(val = sample(values, 10, replace = TRUE))

A for loop is enough to define the new columns.

for(level_var in dt[, unique(val)]){
  dt[, eval(level_var) := ifelse(val == level_var, level_var, 0)]
}

RHertel · Answer

Here's another suggestion. The code is somewhat longer, but I believe it does the trick, and I hope that it can be understood easily. I'm assuming that the original data is stored in a tab-separated file named 'data.dat'. The output of the code is stored in the matrix 'new_matrix'. The entries are characters, but it shouldn't be a problem to convert them into integers if necessary.

data <- read.table('data.dat', sep='	', header = TRUE, colClasses = "character")
var2 <- data[3]
nc <- nchar(var2[1,1])
last2 <-substr(var2[,1],nc-1,nc)
subcat <-levels(factor(last2))
mrows <- nrow(data)
mcols <- length(subcat)
varnames <-paste0("Variable",as.character(c(1:(mcols+1))))
new_matrix <- matrix(paste(replicate(nc,"0"),collapse=""),nrow=mrows,ncol=mcols+2)
colnames(new_matrix) <- c("YEAR",varnames)
new_matrix[,1]<-data[,1]
new_matrix[,2]<-data[,2]
for (i in 1:mcols) {
    relevant_rows <- which(last2 == subcat[i])
    new_matrix[relevant_rows,i+2]<-data[relevant_rows,3]
}

Hope this helps.

mpalanco · Answer

Using reshape2. A one-line solution. Another line if we'd like to remove the NA values.

library(reshape2)
df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009),
                 Var1=c('000000','000000','000000','000000','000000','000000','100000','100000','100000'),
                 Var2=c('0000001','0000001','0000001','0000002','0000002','0000003','1000001','1000001','1000011')) 
df <- dcast(df, YEAR + Var1 + Var2 ~ Var2, value.var = "Var2")[, -3]
df[is.na(df)] <- 0

Result:

  YEAR   Var1 0000001 0000002 0000003 1000001 1000011
1 2009 000000 0000001       0       0       0       0
2 2009 000000       0 0000002       0       0       0
3 2009 000000       0       0 0000003       0       0
4 2009 100000       0       0       0 1000001       0
5 2009 100000       0       0       0       0 1000011
6 2010 000000 0000001       0       0       0       0
7 2010 000000       0 0000002       0       0       0
8 2010 100000       0       0       0 1000001       0
9 2011 000000 0000001       0       0       0       0

Creating new columns by splitting a variable into many variables (in R)

Tags:

string

split

dataframe

r

woodstock

3 Answers

Michele Usuelli

RHertel

mpalanco

Recent Activity

Donate For Us

Creating new columns by splitting a variable into many variables (in R)

Tags:

string

split

dataframe

r

woodstock

3 Answers

Michele Usuelli

RHertel

mpalanco

Related questions

Recent Activity

Donate For Us