Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why rbind throws a warning

Tags:

r

This is related to Are there more elegant ways to transform ragged data into a tidy dataframe

Why following code is not working:

events = structure(list(date = structure(c(-714974, -714579, -717835), class = "Date"), 
    days = c(1, 6, 0.5), name = c("Intro to stats", "Stats Winter school", 
    "TidyR tools"), topics = c("probability|R", "R|regression|ggplot", 
    "tidyR|dplyr")), .Names = c("date", "days", "name", "topics"
), row.names = c(NA, -3L), class = "data.frame")

> newdf <- data.frame(topic=character(), days=character())
> for(i in 1:length(events$topics)){
+ xx = unlist(strsplit(events$topics[i],'\\|'))
+ for(j in 1:length(xx)){
+ yy = c(xx[j], events$days[i]/length(xx))
+ print(yy)
+ newdf=rbind(newdf, yy)
+ }
+ }
[1] "probability" "0.5"        
[1] "R"   "0.5"
[1] "R" "2"
[1] "regression" "2"         
[1] "ggplot" "2"     
[1] "tidyR" "0.25" 
[1] "dplyr" "0.25" 
There were 11 warnings (use warnings() to see them)
> newdf
  X.probability. X.0.5.
1    probability    0.5
2           <NA>    0.5
3           <NA>   <NA>
4           <NA>   <NA>
5           <NA>   <NA>
6           <NA>   <NA>
7           <NA>   <NA>
> 
> warnings()
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA ... :
  invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
3: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
4: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
5: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
6: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
7: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
8: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
9: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
10: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
11: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
> 

yy is okay but rbind is not working. Where is the error and how can it be corrected? Thanks for your help.

like image 227
rnso Avatar asked Aug 03 '14 08:08

rnso


People also ask

What is the meaning of the Rbind ()?

rbind() function in R Language is used to combine specified Vector, Matrix or Data Frame by rows.

What is the difference between Rbind and bind_rows?

rbind throws an error in such a case whereas bind_rows assigns " NA " to those rows of columns missing in one of the data frames where the value is not provided by the data frames.

What is the difference between Rbind and Cbind?

cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows.

Why is Rbind slow?

The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.


3 Answers

You may try:

newdf <- data.frame(topic=character(), daysPerTopic=character(), stringsAsFactors=F)
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = data.frame(topic=xx[j], daysPerTopic=events$days[i]/length(xx), stringsAsFactors=F)
newdf <- rbind(newdf, yy) 
 }
 }

 newdf
#        topic daysPerTopic
# 1 probability         0.50
# 2           R         0.50
# 3           R         2.00
# 4  regression         2.00
# 5      ggplot         2.00
# 6       tidyR         0.25
# 7       dplyr         0.25

Or

 op <- options(stringsAsFactors=F)  #set to F

 #Your code
 newdf <- data.frame(topic=character(), days=character())
 for(i in 1:length(events$topics)){
 xx = unlist(strsplit(events$topics[i],'\\|'))
 for(j in 1:length(xx)){
yy = c(xx[j], events$days[i]/length(xx))
print(yy)
newdf=rbind(newdf, yy)
 }
 }

 newdf
#  X.probability. X.0.5.
# 1    probability    0.5
# 2              R    0.5
# 3              R      2
# 4     regression      2
# 5         ggplot      2
# 6          tidyR   0.25
# 7          dplyr   0.25

 options(op) #et back to default
like image 66
akrun Avatar answered Oct 26 '22 09:10

akrun


Did you even try to debug your for loop? For example, by adding print(class(yy)) print(str(newdf)) you would see that after first iteration both newdf vectors become factors.

# [1] "probability" "0.5"        
# [1] "character"
# 'data.frame':  0 obs. of  2 variables:
#   $ topic: Factor w/ 0 levels: 
#   $ days : Factor w/ 0 levels: 
#   NULL
# [1] "R"   "0.5"
# [1] "character"
# 'data.frame': 1 obs. of  2 variables:
#   $ X.probability.: Factor w/ 1 level "probability": 1
# $ X.0.5.        : Factor w/ 1 level "0.5": 1
# NULL
# [1] "R" "2"
# [1] "character"
# 'data.frame': 2 obs. of  2 variables:
#   $ X.probability.: Factor w/ 1 level "probability": 1 NA
# $ X.0.5.        : Factor w/ 1 level "0.5": 1 1

...

You would say "but I defined them as character". True, but if you'll read rbind documentation, you will see that

For cbind (rbind), vectors of zero length (including NULL) are ignored unless the result would have zero rows (columns), for S compatibility. (Zero-extent matrices do not occur in S3 and are not ignored in R.)

Another property of rbind is that it inherits it's properties from data.frame while one of them is stringsAsFactors == TRUE

What happened here could be easily illustrated in a dummy example, consider

temp <- data.frame(A = letters[1:3])
str(temp)
## 'data.frame':    3 obs. of  1 variable:
## $ A: Factor w/ 3 levels "a","b","c": 1 2 3

temp$A[3] <- "d"
## Warning message:
## In `[<-.factor`(`*tmp*`, 3, value = c(1L, 2L, NA)) :
##   invalid factor level, NA generated

temp$A
## [1] a    b    <NA>
## Levels: a b c

You can see two things here:

  • data.frame automatically converted character class to factors
  • When trying to parse a new level to factor vector it converts it into NA and throws the exact error you were receiving

As mentioned by @akrun, setting to options(stringsAsFactors=F) will solve your problem

like image 43
David Arenburg Avatar answered Oct 26 '22 09:10

David Arenburg


Set options(stringsAsFactors=FALSE) and your code should work as expected. The reason for the warnings and NA's in the result is because of the implicit conversion to factors and the type mismatch between newdf columns and yy, see https://stackoverflow.com/a/1640729/1541036.

For a cleaner way of achieving the same result, here's a group by solution using data.table

library(data.table)
events <- as.data.table(events)
events2 <- events[, list(topic=unlist(strsplit(topics, '|', fixed=TRUE))), by=c("date", "days", "name")]
events2[, probability := days / .N, by=name]
like image 44
ytsaig Avatar answered Oct 26 '22 10:10

ytsaig