I just started using R, and came across data.table. I found it brilliant. A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?

From the data.table FAQ <h3>FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?</h3> <blockquote> As FAQ 1.1 highlights, <code>j</code> in <code>[.data.table</code> is fundamentally different from <code>j</code> in <code>[.data.frame</code>. Even something as simple as <code>DF[,1]</code> would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17). Furthermore, <code>data.table</code> inherits from <code>data.frame</code>. It is a <code>data.frame</code>, too. A <code>data.table</code> can be passed to any package that only accepts <code>data.frame</code> and that package can use <code>[.data.frame</code> syntax on the <code>data.table</code>. We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 : <blockquote> <code>unique()</code> and <code>match()</code> are now faster on character vectors where all elements are in the global <code>CHARSXP</code> cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in <code>unique.</code>c. </blockquote> A second proposal was to use <code>memcpy</code> in <code>duplicate.c</code>, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html. </blockquote> <h3>What are the smaller syntax differences between <code>data.frame</code> and data.table</h3> <blockquote> <ul> <li> <code>DT[3]</code> refers to the 3rd row, but <code>DF[3]</code> refers to the 3rd column </li> <li> <code>DT[3, ] == DT[3]</code>, but <code>DF[ , 3] == DF[3]</code> (somewhat confusingly in data.frame, whereas data.table is consistent)</li> <li>For this reason we say the comma is optional in <code>DT</code>, but not optional in <code>DF</code> </li> <li><code>DT[[3]] == DF[, 3] == DF[[3]]</code></li> <li> <code>DT[i, ]</code>, where <code>i</code> is a single integer, returns a single row, just like <code>DF[i, ]</code>, but unlike a matrix single-row subset which returns a vector.</li> <li> <code>DT[ , j]</code> where <code>j</code> is a single integer returns a one-column data.table, unlike <code>DF[, j]</code> which returns a vector by default</li> <li> <code>DT[ , "colA"][[1]] == DF[ , "colA"]</code>.</li> <li> <code>DT[ , colA] == DF[ , "colA"]</code> (currently in data.table v1.9.8 but is about to change, see release notes)</li> <li><code>DT[ , list(colA)] == DF[ , "colA", drop = FALSE]</code></li> <li> <code>DT[NA]</code> returns 1 row of <code>NA</code>, but <code>DF[NA]</code> returns an entire copy of <code>DF</code> containing <code>NA</code> throughout. The symbol <code>NA</code> is type <code>logical</code> in R and is therefore recycled by <code>[.data.frame</code>. The user's intention was probably <code>DF[NA_integer_]</code>. <code>[.data.table</code> diverts to this probable intention automatically, for convenience.</li> <li> <code>DT[c(TRUE, NA, FALSE)]</code> treats the <code>NA</code> as <code>FALSE</code>, but <code>DF[c(TRUE, NA, FALSE)]</code> returns <code>NA</code> rows for each <code>NA</code> </li> <li> <code>DT[ColA == ColB]</code> is simpler than <code>DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]</code> </li> <li> <code>data.frame(list(1:2, "k", 1:4))</code> creates 3 columns, data.table creates one <code>list</code> column.</li> <li> <code>check.names</code> is by default <code>TRUE</code> in <code>data.frame</code> but <code>FALSE</code> in data.table, for convenience.</li> <li> <code>stringsAsFactors</code> is by default <code>TRUE</code> in <code>data.frame</code> but <code>FALSE</code> in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to <code>factor</code>.</li> <li>Atomic vectors in <code>list</code> columns are collapsed when printed using <code>", "</code> in <code>data.frame</code>, but <code>","</code> in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In <code>[.data.frame</code> we very often set <code>drop = FALSE</code>. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column <code>data.frame</code>. In <code>[.data.table</code> we took the opportunity to make it consistent and dropped <code>drop</code>. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.</li> </ul> </blockquote> <hr> <h3>Small caveat</h3> There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that <code>data.table</code> is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly. For example <ul> <li>see this question and prompt response</li> <li>From the NEWS for v 1.8.2 </li> </ul> <blockquote> <ul> <li>base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.</li> <li>An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.</li> </ul> </blockquote>

What you can do with a data.frame that you can't with a data.table?

1 Answers

From the data.table FAQ

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

What are the smaller syntax differences between `data.frame` and data.table

DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column

DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)

For this reason we say the comma is optional in DT, but not optional in DF

DT[[3]] == DF[, 3] == DF[[3]]

DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.

DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default

DT[ , "colA"][[1]] == DF[ , "colA"].

DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)

DT[ , list(colA)] == DF[ , "colA", drop = FALSE]

DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.

DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns NA rows for each NA

DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]

data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.

check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.

stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.

Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.

Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

For example

see this question and prompt response
From the NEWS for v 1.8.2

base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.

An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

143

answered Oct 11 '22 12:10

5 revs, 4 users 67%

Related questions
                            
                                Using ggplot2, can I insert a break in the axis?
                            
                                Round up from .5
                            
                                Multiply rows of matrix by vector?
                            
                                Keeping trailing zeros
                            
                                Append data frames together in a for loop
                            
                                R: losing column names when adding rows to an empty data frame
                            
                                How to tell CRAN to install package dependencies automatically?
                            
                                How to group data.table by multiple columns?
                            
                                Proxy setting for R
                            
                                Error: package or namespace load failed for ggplot2 and for data.table
                            
                                Get dplyr count of distinct in a readable way
                            
                                How to use random forests in R with missing values?
                            
                                Create a Vector of All Days Between Two Dates
                            
                                Can dplyr summarise over several variables without listing each one? [duplicate]
                            
                                Add count of unique / distinct values by group to the original data
                            
                                RMarkdown: How to change the font color?
                            
                                Finding local maxima and minima
                            
                                Replacing character values with NA in a data frame
                            
                                Create a numeric vector with names in one statement?
                            
                                General guide for creating publication quality tables using R, Sweave, and LaTeX

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What you can do with a data.frame that you can't with a data.table?

Tags:

dataframe

r

data.table

AdamNYC

People also ask

1 Answers

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

What are the smaller syntax differences between `data.frame` and data.table

Small caveat

5 revs, 4 users 67%

Recent Activity

Donate For Us

What you can do with a data.frame that you can't with a data.table?

Tags:

dataframe

r

data.table

AdamNYC

People also ask

1 Answers

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

What are the smaller syntax differences between data.frame and data.table

Small caveat

5 revs, 4 users 67%

Related questions

Recent Activity

Donate For Us

What are the smaller syntax differences between `data.frame` and data.table