Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Delimited Data In a DataFrame Into Separate Columns in R

Tags:

r

I have a data frame which looks as such

A  B  C
1  3  X1=7;X2=8;X3=9
2  4  X1=10;X2=11;X3=12
5  6  X1=13;X2=14

I would like to parse the C column into separate columns as such...

A  B  X1  X2  X3
1  3  7   8   9
2  4  10  11  12
5  6  13  14  NA

How would one go about doing this in R?

like image 692
Craig Avatar asked Jun 10 '14 22:06

Craig


4 Answers

First, here's the sample data in data.frame form

dd<-data.frame(
    A = c(1L, 2L, 5L), 
    B = c(3L, 4L, 6L), 
    C = c("X1=7;X2=8;X3=9", 
    "X1=10;X2=11;X3=12", "X1=13;X2=14"),
    stringsAsFactors=F
)

Now I define a small helper function to take vectors like c("A=1","B=2") and changed them into named vectors like c(A="1", B="2").

namev<-function(x) {
    a<-strsplit(x,"=")
    setNames(sapply(a,'[',2), sapply(a,'[',1))
}

and now I perform the transformations

#turn each row into a named vector
vv<-lapply(strsplit(dd$C,";"), namev)
#find list of all column names
nm<-unique(unlist(sapply(vv, names)))
#extract data from all rows for every column
nv<-do.call(rbind, lapply(vv, '[', nm))
#convert everything to numeric (optional)
class(nv)<-"numeric"
#rejoin with original data
cbind(dd[,-3], nv)

and that gives you

  A B X1 X2 X3
1 1 3  7  8  9
2 2 4 10 11 12
3 5 6 13 14 NA
like image 112
MrFlick Avatar answered Oct 16 '22 08:10

MrFlick


My cSplit function makes solving problems like these fun. Here it is in action:

## Load some packages
library(data.table)
library(devtools) ## Just for source_gist, really
library(reshape2)

## Load `cSplit`
source_gist("https://gist.github.com/mrdwab/11380733")

First, split your values up and create a "long" dataset:

ddL <- cSplit(cSplit(dd, "C", ";", "long"), "C", "=")
ddL
#    A B C_1 C_2
# 1: 1 3  X1   7
# 2: 1 3  X2   8
# 3: 1 3  X3   9
# 4: 2 4  X1  10
# 5: 2 4  X2  11
# 6: 2 4  X3  12
# 7: 5 6  X1  13
# 8: 5 6  X2  14

Next, use dcast.data.table (or just dcast) to go from "long" to "wide":

dcast.data.table(ddL, A + B ~ C_1, value.var="C_2")
#    A B X1 X2 X3
# 1: 1 3  7  8  9
# 2: 2 4 10 11 12
# 3: 5 6 13 14 NA
like image 44
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 16 '22 08:10

A5C1D2H2I1M1N2O1R2T1


Here's one possible approach:

dat <- read.table(text="A  B  C
1  3  X1=7;X2=8;X3=9
2  4  X1=10;X2=11;X3=12
5  6  X1=13;X2=14", header=TRUE, stringsAsFactors = FALSE)


library(qdapTools)
dat_C <- strsplit(dat$C, ";")

dat_C2 <- sapply(dat_C, function(x) {
    y <- strsplit(x, "=")
    rep(sapply(y, "[", 1), as.numeric(sapply(y, "[", 2)))
})

data.frame(dat[, -3], mtabulate(dat_C2))

##   A B X1 X2 X3
## 1 1 3  7  8  9
## 2 2 4 10 11 12
## 3 5 6 13 14  0

EDIT To obtain the NA values

m <- mtabulate(dat_C2)
m[m==0] <- NA
data.frame(dat[, -3], m)
like image 30
Tyler Rinker Avatar answered Oct 16 '22 08:10

Tyler Rinker


Here's a nice, somewhat hacky way to get you there.

## read your data
> dat <- read.table(h=T, text = "A  B  C
  1  3  X1=7;X2=8;X3=9
  2  4  X1=10;X2=11;X3=12
  5  6  X1=13;X2=14", stringsAsFactors = FALSE)
## ---
> s <- strsplit(dat$C, ";|=")
> xx <- unique(unlist(s)[grepl('[A-Z]', unlist(s))])
> sap <- t(sapply(seq(s), function(i){
      wh <- which(!xx %in% s[[i]]); n <- suppressWarnings(as.numeric(s[[i]]))
      nn <- n[!is.na(n)]; if(length(wh)){ append(nn, NA, wh-1) } else { nn }
      })) ## see below for explanation
> data.frame(dat[1:2], sap)
#   A B X1 X2 X3
# 1 1 3  7  8  9
# 2 2 4 10 11 12
# 3 5 6 13 14 NA

Basically what's happening in sap is

  1. check which values are missing
  2. change each list element of s to numeric
  3. remove the NA values from (2)
  4. insert NA into the correct position with append
  5. transpose the result
like image 44
Rich Scriven Avatar answered Oct 16 '22 07:10

Rich Scriven