Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use a data.frame or a matrix?

Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type.

Consequently, the choice matrix/data.frame is only problematic if you have data of the same type.

The answer depends on what you are going to do with the data in data.frame/matrix. If it is going to be passed to other functions then the expected type of the arguments of these functions determine the choice.

Also:

Matrices are more memory efficient:

m = matrix(1:4, 2, 2)
d = as.data.frame(m)
object.size(m)
# 216 bytes
object.size(d)
# 792 bytes

Matrices are a necessity if you plan to do any linear algebra-type of operations.

Data frames are more convenient if you frequently refer to its columns by name (via the compact $ operator).

Data frames are also IMHO better for reporting (printing) tabular information as you can apply formatting to each column separately.


Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.

Data frames are often far more convenient; one doesn't always have solely atomic chunks of data lying around.

Note that you can have a character matrix; you don't just have to have numeric data to build a matrix in R.

In converting a data frame to a matrix, note that there is a data.matrix() function, which handles factors appropriately by converting them to numeric values based on the internal levels. Coercing via as.matrix() will result in a character matrix if any of the factor labels is non-numeric. Compare:

> head(as.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
     a   B  
[1,] "a" "A"
[2,] "b" "B"
[3,] "c" "C"
[4,] "d" "D"
[5,] "e" "E"
[6,] "f" "F"
> head(data.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
     a B
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6

I nearly always use a data frame for my data analysis tasks as I often have more than just numeric variables. When I code functions for packages, I almost always coerce to matrix and then format the results back out as a data frame. This is because data frames are convenient.


@Michal: Matrices aren't really more memory efficient:

m <- matrix(1:400000, 200000, 2)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 1600776 bytes

... unless you have a large number of columns:

m <- matrix(1:400000, 2, 200000)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 22400568 bytes

The matrix is actually a vector with additional methods. while data.frame is a list. The difference is down to vector vs list. for computation efficiency, stick with matrix. Using data.frame if you have to.


I cannot stress out more the efficiency difference between the two! While it is true that DFs are more convenient in some especially data analysis cases, they also allow heterogeneous data, and some libraries accept them only, these all is really secondary unless you write a one-time code for a specific task.

Let me give you an example. There was a function that would calculate the 2D path of the MCMC method. Basically, this means we take an initial point (x,y), and iterate a certain algorithm to find a new point (x,y) at each step, constructing this way the whole path. The algorithm involves calculating a quite complex function and the generation of some random variable at each iteration, so when it run for 12 seconds I thought it is fine given how much stuff it does at each step. That being said, the function collected all points in the constructed path together with the value of an objective function in a 3-column data.frame. So, 3 columns is not that large, and the number of steps was also more than reasonable 10,000 (in this kind of problems paths of length 1,000,000 are typical, so 10,000 is nothing). So, I thought a DF 10,000x3 is definitely not an issue. The reason a DF was used is simple. After calling the function, ggplot() was called to draw the resulting (x,y)-path. And ggplot() does not accept a matrix.

Then, at some point out of curiosity I decided to change the function to collect the path in a matrix. Gladly the syntax of DFs and matrices is similar, all I did was to change the line specifying df as a data.frame to one initializing it as a matrix. Here I need also to mention that in the initial code the DF was initialized to have the final size, so later in the code of the function only new values were recorded into already allocated spaces, and there was no overhead of adding new rows to the DF. This makes the comparison even more fair, and it also made my job simpler as I did not need to rewrite anything further in the function. Just one line change from the initial allocation of a data.frame of the required size to a matrix of the same size. To adapt the new version of the function to ggplot(), I converted the now returned matrix to a data.frame to use in ggplot().

After I rerun the code I could not believe the result. The code run in a fraction of a second! Instead of about 12 seconds. And again, the function during the 10,000 iterations only read and wrote values to already allocated spaces in a DF (and now in a matrix). And this difference is also for the reasonable (or rather small) size 10000x3.

So, if your only reason to use a DF is to make it compatible with a library function such as ggplot(), you can always convert it to a DF at the last moment -- work with matrices as far as you feel convenient. If on the other hand there is a more substantial reason to use a DF, such as you use some data analysis package that would require otherwise constant transforming from matrices to DFs and back, or you do not do any intensive calculations yourself and only use standard packages (many of them actually internally transform a DF to a matrix, do their job, and then transform the result back -- so they do all efficiency work for you), or do a one-time job so you do not care and feel more comfortable with DFs, then you should not worry about efficiency.

Or a different more practical rule: if you have a question such as in the OP, use matrices, so you would use DFs only when you do not have such a question (because you already know you have to use DFs, or because you do not really care as the code is one-time etc.).

But in general keep this efficiency point always in mind as a priority.