strsplit into data.frame with incomplete input

Question

I try to split a vector of strings into a data.frame object and for a fixed order this isn't a problem (e.g. like written here), but in my particular case the columns for the future data-frame are not complete in the string objects. This is how the output should look like for an toy input:

input <- c("an=1;bn=3;cn=45",
           "bn=3.5;cn=76",
           "an=2;dn=5")

res <- do.something(input)

> res
      an  bn  cn  dn
[1,]  1   3   45  NA
[2,]  NA  3.5 76  NA
[3,]  2   NA  NA  5

I am looking now for a function do.somethingthat can do that in a efficient way. My naive solution at the moment would be to loop over the input objects, strsplit those for ; then strsplit them again for = and then fill the data.frame result by result. Is there any way to do that more R-alike? I am afraid doing that element by element would take quite a long time for a long vector input.

EDIT: Just for completeness, my naive solution looks like this:

  do.something <- function(x){
    temp <- strsplit(x,";")
    temp2 <- sapply(temp,strsplit,"=")
    ul.temp2 <- unlist(temp2)
    label <- sort(unique(ul.temp2[seq(1,length(ul.temp2),2)]))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(label)))
    colnames(res) <- label
    for(i in 1:length(temp)){
      for(j in 1:length(label)){
        curInfo <- unlist(temp2[[i]])
        if(sum(is.element(curInfo,label[j]))>0){
          res[i,j] <- curInfo[which(curInfo==label[j])+1]
        }
      }
    }
    res
  }

EDIT2: Unfortunately my large input data looks like this (entries without '=' possible):

input <- c("an=1;bn=3;cn=45",
           "an;bn=3.5;cn=76",
           "an=2;dn=5")

so I cannot compare the given answers to my problem at hand. My naive solution for that is

do.something <- function(x){
    temp <- strsplit(x,";")
    tempNames <- sort(unique(sapply(strsplit(unlist(temp),"="),"[",1)))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(tempNames)))
    colnames(res) <- tempNames

    for(i in 1:length(temp)){
      curSplit <- strsplit(unlist(temp[[i]]),"=")
      curNames <- sapply(curSplit,"[",1)
      curValues <- sapply(curSplit,"[",2)
      for(j in 1:length(tempNames)){
        if(is.element(colnames(res)[j],curNames)){
          res[i,j] <- curValues[curNames==colnames(res)[j]]
        }
      }
    }
    res
  }

Simon O'Hanlon · Accepted Answer

Here's another way which should work even given your edited data. Extract the column names and values from your input vector using regmatches, then run through each list element matching the values to the appropriate column names.

#  Get column names
tag <- regmatches( input , gregexpr( "[a-z]+" , input ) )

#  Get numbers including floating point, replace missing values with NA
val <- regmatches( input , gregexpr( "\d+\.?\d?|(?<=[a-z]);" , input , perl = TRUE ) )
val <- lapply( val , gsub , pattern = ";" , replacement = NA )

#  Column names
nms <- unique( unlist(tag) )

#  Intermeidate matrices
ll <- mapply( cbind , val , tag )

#  Match to appropriate columns and coerce to data.frame
out <- data.frame( do.call( rbind , lapply( ll , function(x) x[ match( nms , x[,2] ) ]  ) ) )
names(out) <- nms
#    an   bn   cn   dn
#1    1    3   45 <NA>
#2 <NA>  3.5   76 <NA>
#3    2 <NA> <NA>    5

kohske · Answer

This is a kind of bad techniq but sometimes ept( eval parse text) is useful.

> library(plyr)
> rbind.fill(lapply(input, function(x) {l <- new.env(); eval(parse(text = x), envir=l); as.data.frame(as.list(l))}))
  an cn  bn dn
1  1 45 3.0 NA
2 NA 76 3.5 NA
3  2 NA  NA  5

Update

> z <- lapply(strsplit(input, ";"), 
+             function(x) {
+               e <- Filter(function(y) length(y)==2, strsplit(x, "="))
+               r <- data.frame(lapply(e, `[`, 2))
+               names(r) <- lapply(e, `[`, 1)
+               r
+             })
> rbind.fill(z)
    an   bn   cn   dn
1    1    3   45 <NA>
2 <NA>  3.5   76 <NA>
3    2 <NA> <NA>    5

strsplit into data.frame with incomplete input

Tags:

string

dataframe

r

Daniel Fischer

2 Answers

Simon O'Hanlon

kohske

Recent Activity

Donate For Us

strsplit into data.frame with incomplete input

Tags:

string

dataframe

r

Daniel Fischer

2 Answers

Simon O'Hanlon

kohske

Related questions

Recent Activity

Donate For Us