I have a data frame with one column which is a list, like so:
>head(movies$genre_list)
[[1]]
[1] "drama" "action" "romance"
[[2]]
[1] "crime" "drama"
[[3]]
[1] "crime" "drama" "mystery"
[[4]]
[1] "thriller" "indie"
[[5]]
[1] "thriller"
[[6]]
[1] "drama" "family"
I want to convert this one column to multiple columns, one for each unique element across the lists (in this case, genres), and have them as binary columns. I'm looking for an elegant solution, which doesn't involve first finding out how many genres are there, and then creating a column for each, and then checking each list element to then populate the genre columns. I tried unlist, but it doesn't work with a vector of lists in the way I want.
Thanks!
Here are a few approaches:
movies <- data.frame(genre_list = I(list(
c("drama", "action", "romance"),
c("crime", "drama"),
c("crime", "drama", "mystery"),
c("thriller", "indie"),
c("thriller"),
c("drama", "family"))))
You can use the mtabulate
function from "qdapTools" or the unexported charMat
function from my "splitstackshape" package.
Syntax would be:
library(qdapTools)
mtabulate(movies$genre_list)
# action crime drama family indie mystery romance thriller
# 1 1 0 1 0 0 0 1 0
# 2 0 1 1 0 0 0 0 0
# 3 0 1 1 0 0 1 0 0
# 4 0 0 0 0 1 0 0 1
# 5 0 0 0 0 0 0 0 1
# 6 0 0 1 1 0 0 0 0
or
splitstackshape:::charMat(movies$genre_list, fill = 0)
# action crime drama family indie mystery romance thriller
# [1,] 1 0 1 0 0 0 1 0
# [2,] 0 1 1 0 0 0 0 0
# [3,] 0 1 1 0 0 1 0 0
# [4,] 0 0 0 0 1 0 0 1
# [5,] 0 0 0 0 0 0 0 1
# [6,] 0 0 1 1 0 0 0 0
Improved option 1: Use table
somewhat directly:
table(rep(1:nrow(movies), sapply(movies$genre_list, length)),
unlist(movies$genre_list, use.names=FALSE))
Improved option 2: Use a for
loop.
x <- unique(unlist(movies$genre_list, use.names=FALSE))
m <- matrix(0, ncol = length(x), nrow = nrow(movies), dimnames = list(NULL, x))
for (i in 1:nrow(m)) {
m[i, movies$genre_list[[i]]] <- 1
}
m
Below is the OLD answer
Convert the list to a list of table
s (in turn converted to data.frame
s):
tables <- lapply(seq_along(movies$genre_list), function(x) {
temp <- as.data.frame.table(table(movies$genre_list[[x]]))
names(temp) <- c("Genre", paste("Record", x, sep = "_"))
temp
})
Use Reduce
to merge
the resulting list. If I understand your end goal correctly, this results in the transposed form of the result you are interested in.
merged_tables <- Reduce(function(x, y) merge(x, y, all = TRUE), tables)
merged_tables
# Genre Record_1 Record_2 Record_3 Record_4 Record_5 Record_6
# 1 action 1 NA NA NA NA NA
# 2 drama 1 1 1 NA NA 1
# 3 romance 1 NA NA NA NA NA
# 4 crime NA 1 1 NA NA NA
# 5 mystery NA NA 1 NA NA NA
# 6 indie NA NA NA 1 NA NA
# 7 thriller NA NA NA 1 1 NA
# 8 family NA NA NA NA NA 1
Transposing and converting NA
to 0
is pretty straightforward. Just drop the first column and re-use it as the column names
for the new data.frame
movie_genres <- setNames(data.frame(t(merged_tables[-1])), merged_tables[[1]])
movie_genres[is.na(movie_genres)] <- 0
movie_genres
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With