I try to split a vector of strings into a data.frame object and for a fixed order this isn't a problem (e.g. like written here), but in my particular case the columns for the future data-frame are not complete in the string objects. This is how the output should look like for an toy input:
input <- c("an=1;bn=3;cn=45",
"bn=3.5;cn=76",
"an=2;dn=5")
res <- do.something(input)
> res
an bn cn dn
[1,] 1 3 45 NA
[2,] NA 3.5 76 NA
[3,] 2 NA NA 5
I am looking now for a function do.something
that can do that in a efficient way. My naive solution at the moment would be to loop over the input objects, strsplit
those for ;
then strsplit
them again for =
and then fill the data.frame
result by result.
Is there any way to do that more R-alike? I am afraid doing that element by element would take quite a long time for a long vector input
.
EDIT: Just for completeness, my naive solution looks like this:
do.something <- function(x){
temp <- strsplit(x,";")
temp2 <- sapply(temp,strsplit,"=")
ul.temp2 <- unlist(temp2)
label <- sort(unique(ul.temp2[seq(1,length(ul.temp2),2)]))
res <- data.frame(matrix(NA, nrow = length(x), ncol = length(label)))
colnames(res) <- label
for(i in 1:length(temp)){
for(j in 1:length(label)){
curInfo <- unlist(temp2[[i]])
if(sum(is.element(curInfo,label[j]))>0){
res[i,j] <- curInfo[which(curInfo==label[j])+1]
}
}
}
res
}
EDIT2: Unfortunately my large input data looks like this (entries without '=' possible):
input <- c("an=1;bn=3;cn=45",
"an;bn=3.5;cn=76",
"an=2;dn=5")
so I cannot compare the given answers to my problem at hand. My naive solution for that is
do.something <- function(x){
temp <- strsplit(x,";")
tempNames <- sort(unique(sapply(strsplit(unlist(temp),"="),"[",1)))
res <- data.frame(matrix(NA, nrow = length(x), ncol = length(tempNames)))
colnames(res) <- tempNames
for(i in 1:length(temp)){
curSplit <- strsplit(unlist(temp[[i]]),"=")
curNames <- sapply(curSplit,"[",1)
curValues <- sapply(curSplit,"[",2)
for(j in 1:length(tempNames)){
if(is.element(colnames(res)[j],curNames)){
res[i,j] <- curValues[curNames==colnames(res)[j]]
}
}
}
res
}
Here's another way which should work even given your edited data. Extract the column names and values from your input vector using regmatches
, then run through each list element matching the values to the appropriate column names.
# Get column names
tag <- regmatches( input , gregexpr( "[a-z]+" , input ) )
# Get numbers including floating point, replace missing values with NA
val <- regmatches( input , gregexpr( "\\d+\\.?\\d?|(?<=[a-z]);" , input , perl = TRUE ) )
val <- lapply( val , gsub , pattern = ";" , replacement = NA )
# Column names
nms <- unique( unlist(tag) )
# Intermeidate matrices
ll <- mapply( cbind , val , tag )
# Match to appropriate columns and coerce to data.frame
out <- data.frame( do.call( rbind , lapply( ll , function(x) x[ match( nms , x[,2] ) ] ) ) )
names(out) <- nms
# an bn cn dn
#1 1 3 45 <NA>
#2 <NA> 3.5 76 <NA>
#3 2 <NA> <NA> 5
This is a kind of bad techniq but sometimes ept
( eval parse text
) is useful.
> library(plyr)
> rbind.fill(lapply(input, function(x) {l <- new.env(); eval(parse(text = x), envir=l); as.data.frame(as.list(l))}))
an cn bn dn
1 1 45 3.0 NA
2 NA 76 3.5 NA
3 2 NA NA 5
Update
> z <- lapply(strsplit(input, ";"),
+ function(x) {
+ e <- Filter(function(y) length(y)==2, strsplit(x, "="))
+ r <- data.frame(lapply(e, `[`, 2))
+ names(r) <- lapply(e, `[`, 1)
+ r
+ })
> rbind.fill(z)
an bn cn dn
1 1 3 45 <NA>
2 <NA> 3.5 76 <NA>
3 2 <NA> <NA> 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With