Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Values from multiple dataframe columns into one vector




I have a dataframe df that has many cols and say 100 rows.

How do I take all the level values from the columns with names "alpha", "gamma" and "zeta" and store the 300 of them in a single vector?

like image 980
Benoît Pointet Avatar asked Dec 12 '13 05:12

Benoît Pointet

People also ask

How do you subset a Dataframe from a vector?

If we have a vector and a data frame, and the data frame has a column that contains the values similar as in the vector then we can create a subset of the data frame based on that vector. This can be done with the help of single square brackets and %in% operator.

Can you have a vector of Dataframes?

A data frame can be stored numeric data or character data or factor type data. Each column in the data frame should contain an equal number of the data elements. The Data frame can be converted from vectors in R. To create a data frame in R using the vector, we must first have a series of vectors containing data.

2 Answers

You have an accepted answer, but here's what I think is happening: You have a combination of factor and character columns. In that case, unlist doesn't work directly, but if they were all factor or if they were all character, there would be no problem:

Some sample data:

mydf <- data.frame(A = LETTERS[1:3], B = LETTERS[4:6], C = LETTERS[7:9],
                   D = LETTERS[10:12], E = LETTERS[13:15])
df <- mydf
df$E <- as.character(df$E)
colsOfInterest <- c("A", "B", "E")

Case 1, all columns are factors

unlist(mydf[colsOfInterest], use.names = FALSE)
# [1] A B C D E F M N O
# Levels: A B C D E F M N O

Case 2, column E = characters, other columns factors

unlist(df[colsOfInterest], use.names = FALSE)
# [1] "1" "2" "3" "1" "2" "3" "M" "N" "O"

unlist(lapply(df[colsOfInterest], as.character), use.names = FALSE)
# [1] "A" "B" "C" "D" "E" "F" "M" "N" "O"

For a problem at the scale described here, the benchmarks show that converting to character first and using unlist is actually the fastest approach if you don't care for retaining factors. Note that the result of fun1() won't be correct if some columns are factors and some are characters. Here's a benchmark on a 100 row data.frame:

microbenchmark(fun1(), fun2(), fun3())
# Unit: microseconds
#    expr      min        lq    median       uq      max neval
#  fun1()  572.606  587.3595  595.4845  606.175 3439.055   100
#  fun2()  327.570  334.6265  341.2550  350.449 3443.758   100
#  fun3() 1037.020 1055.6215 1064.1745 1086.197 3929.981   100

Of course, here we're talking microseconds, but the results scale too.

For reference, here's what was used for benchmarking. Change "nRow" and "nCol" if you want to test on a different sized data.frame extracting different numbers of columns.

nRow <- 100
nCol <- 30
mydf <- data.frame(matrix(sample(LETTERS, nRow*nCol, replace = TRUE), nrow = nRow))
colsOfInterest <- sample(nCol, sample(nCol*.7, 1))
# [1] 17

fun1 <- function() unlist(mydf[colsOfInterest], use.names = FALSE)
fun2 <- function() unlist(lapply(mydf[colsOfInterest], as.character), use.names = FALSE)
fun3 <- function() as.vector(as.matrix(mydf[colsOfInterest]))
microbenchmark(fun1(), fun2(), fun3())
like image 67
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 26 '22 06:10


I've found that converting to a matrix first makes getting to levels a bit easier.

as.vector(as.matrix(df[,c("alpha", "gamma", "zeta")]))

Of course, you could have just done stringsAsFactors=FALSE when you read the data in initially.

like image 29
Neal Fultz Avatar answered Oct 26 '22 07:10

Neal Fultz