I have a dataframe df
that has many cols and say 100 rows.
How do I take all the level values from the columns with names "alpha", "gamma" and "zeta" and store the 300 of them in a single vector?
If we have a vector and a data frame, and the data frame has a column that contains the values similar as in the vector then we can create a subset of the data frame based on that vector. This can be done with the help of single square brackets and %in% operator.
A data frame can be stored numeric data or character data or factor type data. Each column in the data frame should contain an equal number of the data elements. The Data frame can be converted from vectors in R. To create a data frame in R using the vector, we must first have a series of vectors containing data.
You have an accepted answer, but here's what I think is happening: You have a combination of factor
and character
columns. In that case, unlist
doesn't work directly, but if they were all factor
or if they were all character
, there would be no problem:
Some sample data:
mydf <- data.frame(A = LETTERS[1:3], B = LETTERS[4:6], C = LETTERS[7:9],
D = LETTERS[10:12], E = LETTERS[13:15])
df <- mydf
df$E <- as.character(df$E)
colsOfInterest <- c("A", "B", "E")
unlist(mydf[colsOfInterest], use.names = FALSE)
# [1] A B C D E F M N O
# Levels: A B C D E F M N O
unlist(df[colsOfInterest], use.names = FALSE)
# [1] "1" "2" "3" "1" "2" "3" "M" "N" "O"
unlist(lapply(df[colsOfInterest], as.character), use.names = FALSE)
# [1] "A" "B" "C" "D" "E" "F" "M" "N" "O"
For a problem at the scale described here, the benchmarks show that converting to character first and using unlist
is actually the fastest approach if you don't care for retaining factors. Note that the result of fun1()
won't be correct if some columns are factors and some are characters. Here's a benchmark on a 100 row data.frame
:
library(microbenchmark)
microbenchmark(fun1(), fun2(), fun3())
# Unit: microseconds
# expr min lq median uq max neval
# fun1() 572.606 587.3595 595.4845 606.175 3439.055 100
# fun2() 327.570 334.6265 341.2550 350.449 3443.758 100
# fun3() 1037.020 1055.6215 1064.1745 1086.197 3929.981 100
Of course, here we're talking microseconds, but the results scale too.
For reference, here's what was used for benchmarking. Change "nRow
" and "nCol
" if you want to test on a different sized data.frame
extracting different numbers of columns.
nRow <- 100
nCol <- 30
set.seed(1)
mydf <- data.frame(matrix(sample(LETTERS, nRow*nCol, replace = TRUE), nrow = nRow))
colsOfInterest <- sample(nCol, sample(nCol*.7, 1))
length(colsOfInterest)
# [1] 17
library(microbenchmark)
fun1 <- function() unlist(mydf[colsOfInterest], use.names = FALSE)
fun2 <- function() unlist(lapply(mydf[colsOfInterest], as.character), use.names = FALSE)
fun3 <- function() as.vector(as.matrix(mydf[colsOfInterest]))
microbenchmark(fun1(), fun2(), fun3())
I've found that converting to a matrix first makes getting to levels a bit easier.
as.vector(as.matrix(df[,c("alpha", "gamma", "zeta")]))
Of course, you could have just done stringsAsFactors=FALSE
when you read the data in initially.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With