I've been trying to learn about the different data structures used in popular languages that I have experience with such as lists and dictionaries in Python, associative arrays in PHP (essentially hash tables), vectors in C++, etc.
I have a lot of colleagues that use R religiously and I was wondering how vectors, matrices, and data frames are implemented in R. What are their strengths and weaknesses? I was looking through source code but I couldn't find the data structures themselves. Where in the source code are these definitions located?
To create a matrix in R you need to use the function called matrix(). The arguments to this matrix() are the set of elements in the vector. You have to pass how many numbers of rows and how many numbers of columns you want to have in your matrix. Note: By default, matrices are in column-wise order.
In a data frame the columns contain different types of data, but in a matrix all the elements are the same type of data. A matrix in R is like a mathematical matrix, containing all the same type of thing (usually numbers). R often but not always lets these be used interchangably.
Vectors in R are the same as the arrays in C language which are used to hold multiple data values of the same type. Vectors can also be used to create matrices. Matrices can be created with the help of Vectors by using pre-defined functions in R Programming Language.
How to Create a Data Frame. We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.
As already mentioned, check out the "R internals" manual, as well as this part of "Writing R extensions".
From R Internals, 1.1 SEXPs:
... the basic building blocks of R objects are often called nodes... Both types of node structure have as their first three fields a 32-bit spxinfo header and then three pointers (to the attributes and the previous and next node in a doubly-linked list)
So vectors in R are implemented as a doubly-linked list. And, it even appears that there is no data structure smaller than a single-node linked list. This is evident by:
> a <- 4
> a[1]
4
As mentioned by others: builtin.c
has do_makevector
and do_makelist
, and array.c
has the source for do_matrix
. In addition array.c
contains source for allocMatrix
and memory.c
contains the source for allocVector
.
While a lot of things that were going on were over my head, it seems evident that a matrix is simply a doubly-linked list of doubly-linked lists. I believe (though am unsure) that row and column names (like those stored in a data frame) are stored in the 'attributes' of each list.
The response to the "what the strengths and weaknesses" of the implementation of the data structures would be that (from my limited knowledge) doubly linked lists have a strength in that dynamic memory allocation is simpler and doesn't require the overhead of copying and reallocating an entire array, and the weakness being that (depending on how many pointers there are to the list: head, tail, middle, quarters, etc.) accessing a random value v[99]
can take the overhead of iterating through several elements before the desired one is found.
Is this correct?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With