This is related to Are there more elegant ways to transform ragged data into a tidy dataframe
Why following code is not working:
events = structure(list(date = structure(c(-714974, -714579, -717835), class = "Date"),
days = c(1, 6, 0.5), name = c("Intro to stats", "Stats Winter school",
"TidyR tools"), topics = c("probability|R", "R|regression|ggplot",
"tidyR|dplyr")), .Names = c("date", "days", "name", "topics"
), row.names = c(NA, -3L), class = "data.frame")
> newdf <- data.frame(topic=character(), days=character())
> for(i in 1:length(events$topics)){
+ xx = unlist(strsplit(events$topics[i],'\\|'))
+ for(j in 1:length(xx)){
+ yy = c(xx[j], events$days[i]/length(xx))
+ print(yy)
+ newdf=rbind(newdf, yy)
+ }
+ }
[1] "probability" "0.5"
[1] "R" "0.5"
[1] "R" "2"
[1] "regression" "2"
[1] "ggplot" "2"
[1] "tidyR" "0.25"
[1] "dplyr" "0.25"
There were 11 warnings (use warnings() to see them)
> newdf
X.probability. X.0.5.
1 probability 0.5
2 <NA> 0.5
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
>
> warnings()
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA ... :
invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
3: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
4: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
5: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
6: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
7: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
8: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
9: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
10: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
11: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
>
yy is okay but rbind is not working. Where is the error and how can it be corrected? Thanks for your help.
rbind() function in R Language is used to combine specified Vector, Matrix or Data Frame by rows.
rbind throws an error in such a case whereas bind_rows assigns " NA " to those rows of columns missing in one of the data frames where the value is not provided by the data frames.
cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows.
The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.
You may try:
newdf <- data.frame(topic=character(), daysPerTopic=character(), stringsAsFactors=F)
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = data.frame(topic=xx[j], daysPerTopic=events$days[i]/length(xx), stringsAsFactors=F)
newdf <- rbind(newdf, yy)
}
}
newdf
# topic daysPerTopic
# 1 probability 0.50
# 2 R 0.50
# 3 R 2.00
# 4 regression 2.00
# 5 ggplot 2.00
# 6 tidyR 0.25
# 7 dplyr 0.25
Or
op <- options(stringsAsFactors=F) #set to F
#Your code
newdf <- data.frame(topic=character(), days=character())
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = c(xx[j], events$days[i]/length(xx))
print(yy)
newdf=rbind(newdf, yy)
}
}
newdf
# X.probability. X.0.5.
# 1 probability 0.5
# 2 R 0.5
# 3 R 2
# 4 regression 2
# 5 ggplot 2
# 6 tidyR 0.25
# 7 dplyr 0.25
options(op) #et back to default
Did you even try to debug your for
loop? For example, by adding print(class(yy))
print(str(newdf))
you would see that after first iteration both newdf
vectors become factors.
# [1] "probability" "0.5"
# [1] "character"
# 'data.frame': 0 obs. of 2 variables:
# $ topic: Factor w/ 0 levels:
# $ days : Factor w/ 0 levels:
# NULL
# [1] "R" "0.5"
# [1] "character"
# 'data.frame': 1 obs. of 2 variables:
# $ X.probability.: Factor w/ 1 level "probability": 1
# $ X.0.5. : Factor w/ 1 level "0.5": 1
# NULL
# [1] "R" "2"
# [1] "character"
# 'data.frame': 2 obs. of 2 variables:
# $ X.probability.: Factor w/ 1 level "probability": 1 NA
# $ X.0.5. : Factor w/ 1 level "0.5": 1 1
...
You would say "but I defined them as character
". True, but if you'll read rbind
documentation, you will see that
For cbind (rbind), vectors of zero length (including NULL) are ignored unless the result would have zero rows (columns), for S compatibility. (Zero-extent matrices do not occur in S3 and are not ignored in R.)
Another property of rbind
is that it inherits it's properties from data.frame
while one of them is stringsAsFactors == TRUE
What happened here could be easily illustrated in a dummy example, consider
temp <- data.frame(A = letters[1:3])
str(temp)
## 'data.frame': 3 obs. of 1 variable:
## $ A: Factor w/ 3 levels "a","b","c": 1 2 3
temp$A[3] <- "d"
## Warning message:
## In `[<-.factor`(`*tmp*`, 3, value = c(1L, 2L, NA)) :
## invalid factor level, NA generated
temp$A
## [1] a b <NA>
## Levels: a b c
You can see two things here:
data.frame
automatically converted character
class to factorsfactor
vector it converts it into NA
and throws the exact error you were receivingAs mentioned by @akrun, setting to options(stringsAsFactors=F)
will solve your problem
Set options(stringsAsFactors=FALSE) and your code should work as expected. The reason for the warnings and NA's in the result is because of the implicit conversion to factors and the type mismatch between newdf columns and yy, see https://stackoverflow.com/a/1640729/1541036.
For a cleaner way of achieving the same result, here's a group by solution using data.table
library(data.table)
events <- as.data.table(events)
events2 <- events[, list(topic=unlist(strsplit(topics, '|', fixed=TRUE))), by=c("date", "days", "name")]
events2[, probability := days / .N, by=name]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With