Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R-splitting a data frame of factors with NA's

Tags:

r

I have the a dataframe(df) which is imported from web. I am interested with the following column(colname) of df. The elements of colname are recognized as "factors". A sample from df is like below which also includes "NA"s:

colname
57 +0.10
55
NA
57,5 +2.00
56,5 +0.50
56,5
58

I would like to split the colname by "+" and get 3 numeric columns as below. The desired output is:

colname1 colname2 total
57.00    0.10     57.10
55.00    0.00     55.00
NA       NA       NA
57.50    2.00     59.50
56.50    0.50     57.00
56.50    0.00     56.50 
58.00    0.00     58.00

which is also a data frame and the all there columns are numeric. However, I am stuck with this problem. Whatever I do, I can't get the desired result. The errors are caused by mainly "NA"s and "factor" data type. I will be very glad for any help Thanks a lot.

like image 561
oercim Avatar asked Jan 31 '15 07:01

oercim


1 Answers

I would replace the "," to '." using sub. (read.table/read.csv have dec option as well). Using cSplit from splitstackshape, split the columns to two by specifying the sep as ,. The output will be data.table. Create the "Total" column by using the rowSums. If you want to return NA for rows that are all NAs, it is possible (one option is showed in the 2nd solution)

df$colname <- sub(',', '.', df$colname)
library(splitstackshape)
dt <- cSplit(df, 'colname', '+')
dt[, Total:=rowSums(.SD,na.rm=TRUE)][]

Or using base R, split the column ("colname") using strsplit. Output will be a "list". Convert the "character" to "numeric", pad NAs to get the length same in all the list elements and rbind (df2 <- do.call(...,)). Create the "Total" column by rowSums, change the element to NA for those that are NAs in both columns.

 lst <- lapply(strsplit(df$colname, '[+]'), as.numeric)
 df2 <-  do.call(rbind.data.frame, 
     lapply(lst, `length<-`, max(sapply(lst, length))))
 names(df2) <- paste0('colname', 1:2)
 df2$Total <- (NA^!rowSums(!is.na(df2)))*rowSums(df2, na.rm=TRUE)
 df2
 #  colname1 colname2 Total
 #1     57.0      0.1  57.1
 #2     55.0       NA  55.0
 #3       NA       NA    NA
 #4     57.5      2.0  59.5
 #5     56.5      0.5  57.0
 #6     56.5       NA  56.5
 #7     58.0       NA  58.0

Or in this case, eval(parse( could also be used, which will avoid the step of changing 0 to NA

 df2$Total <- unname(sapply(df$colname,
                  function(x) eval(parse(text=x))))

Update

If you need to replace the NA to 0 in "colname2"

df2$colname2[with(df2, is.na(colname2) & !is.na(colname1))] <- 0
 df2
 #  colname1 colname2 Total
 #1     57.0      0.1  57.1
 #2     55.0      0.0  55.0
 #3       NA       NA    NA
 #4     57.5      2.0  59.5
 #5     56.5      0.5  57.0
 #6     56.5      0.0  56.5
 #7     58.0      0.0  58.0

data

 df <- structure(list(colname = structure(c(4L, 1L, NA, 5L, 3L, 2L, 
 6L), .Label = c("55", "56,5", "56,5 +0.50", "57 +0.10", "57,5 +2.00", 
"58"), class = "factor")), .Names = "colname", row.names = c(NA, 
 -7L), class = "data.frame")
like image 180
akrun Avatar answered Sep 27 '22 22:09

akrun