I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
Method 1 : Using setDT() method The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe. The modification is made by reference to the original data structure.
From the data.table FAQ
As FAQ 1.1 highlights,
jin[.data.tableis fundamentally different fromjin[.data.frame. Even something as simple asDF[,1]would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).Furthermore,
data.tableinherits fromdata.frame. It is adata.frame, too. Adata.tablecan be passed to any package that only acceptsdata.frameand that package can use[.data.framesyntax on thedata.table.We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :
unique()andmatch()are now faster on character vectors where all elements are in the globalCHARSXPcache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated inunique.c.A second proposal was to use
memcpyinduplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
data.frame and data.table
DT[3]refers to the 3rd row, butDF[3]refers to the 3rd columnDT[3, ] == DT[3], butDF[ , 3] == DF[3](somewhat confusingly in data.frame, whereas data.table is consistent)- For this reason we say the comma is optional in
DT, but not optional inDFDT[[3]] == DF[, 3] == DF[[3]]DT[i, ], whereiis a single integer, returns a single row, just likeDF[i, ], but unlike a matrix single-row subset which returns a vector.DT[ , j]wherejis a single integer returns a one-column data.table, unlikeDF[, j]which returns a vector by defaultDT[ , "colA"][[1]] == DF[ , "colA"].DT[ , colA] == DF[ , "colA"](currently in data.table v1.9.8 but is about to change, see release notes)DT[ , list(colA)] == DF[ , "colA", drop = FALSE]DT[NA]returns 1 row ofNA, butDF[NA]returns an entire copy ofDFcontainingNAthroughout. The symbolNAis typelogicalin R and is therefore recycled by[.data.frame. The user's intention was probablyDF[NA_integer_].[.data.tablediverts to this probable intention automatically, for convenience.DT[c(TRUE, NA, FALSE)]treats theNAasFALSE, butDF[c(TRUE, NA, FALSE)]returnsNArows for eachNADT[ColA == ColB]is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]data.frame(list(1:2, "k", 1:4))creates 3 columns, data.table creates onelistcolumn.check.namesis by defaultTRUEindata.framebutFALSEin data.table, for convenience.stringsAsFactorsis by defaultTRUEindata.framebutFALSEin data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting tofactor.- Atomic vectors in
listcolumns are collapsed when printed using", "indata.frame, but","in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In[.data.framewe very often setdrop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single columndata.frame. In[.data.tablewe took the opportunity to make it consistent and droppeddrop. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
- base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
- An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With