Is it safe to use "df" as the name for a dataframe?

Tags:

r

A common idiom (found in books, tutorials, and on many Stack Overflow questions) is to use df as a sort of throw-away identifier for a dataframe. I've done so hundreds of times with seemingly no ill-effect, but then ran into the following code:

library(tree)
df <- droplevels(iris[1:100,c(1,2,5)])
tr <- tree(Species ~ ., data = df)
plot(tr)
text(tr)
partition.tree(tr)

This gives the following error message:

Error in as.data.frame.default(data, optional = TRUE) : 
  cannot coerce class ""function"" to a data.frame

I discovered by trial and error that if I simply replace df above by df2, the code works as expected. It is true that df is the name of the density function for the F-distribution, but that doesn't seem to be remotely relevant here. Is this a bug in the tree package, or is it an important cautionary tale whose moral is that I should avoid using df as the name for a dataframe since doing so introduces a name-clash?

413

asked May 03 '18 20:05

John Coleman

2 Answers

Because the potential name conflict would make errors more difficult to debug, I forced myself to use dtf instead of df for a long time. However important collection of package in the tidyverse seem to be ok with using df everywhere in their tests, for example test-select.r:

  df <- tibble(g = 1:3, x = 3:1) %>% group_by(g)

I've been using df a lot recently to name python pandas data frames. So I tend to use df in R as well nowadays. Let's see if this bites back.

Flat or nested namespace

The question of namespace is not part of the original question but it is related to this issue of name conflict with df. A flat name space is easier and fun to use in exploratory data analysis, you just call all functions directly, but it can lead to collisions. A nested namespace makes debugging more reliable at the cost of being a little more cumbersome, because you have to prefix each function call with the package name.

Name space collisions are less of an issue in python because it has a more nested namespace. For example you import numpy as np and prefix all numpy function calls with np, such as np.array(). (It's possible to do from numpy import * but it is frowned upon and linters typically complain about it).

In R you have to distinguish trash code used in exploratory data analysis from more durable code that you are going to reuse. In the second case, if you use only one or a few functions from another package, it's better not to import the package ~~library(package_name)~~ but to call the functions you really need with package_name::function.

179

answered Sep 19 '22 14:09

Paul Rougieux

Is this a bug in the tree package, or is it an important cautionary tale whose moral is that I should avoid using df as the name for a dataframe since doing so introduces a name-clash?

I think in this case it may be both, but for your purposes I would take it more as a cautionary example. The fact that it causes an error here indicates that it may not be the best practice.

In my experience R does not manage namespaces very well (comparing it to Python, for example). Because of this, it may have been unwise for the authors of tree to introduce (intentionally or not) a conflict with df - which is a common throwaway name for a dataframe - if in fact they did so (see comments here and in the question; it is unclear whether this is a clash in data.frame names or improper use of eval() causing clashes between data.frame objects and functions).

With that said, it is a good example of why namespaces are important and (IMO) suggestive of how to write better R code. I think namespaces are being introduced to the R ecosystem, but my experience with R is that there is a lot of namespace 'flatness' and lots of opportunities for name conflicts. For this reason I would suggest that you take this as a reason to use more descriptive / unique identifiers for your own variables. This avoids conflicts like the one you encountered, and provides some future-proofing to help avoid conflicts creeping into previously working code if package internals change.

answered Sep 19 '22 14:09

bjarchi

Related questions
                            
                                Is there a way to call the `[<-` function in `[` form?
                            
                                mutate() is trying to extract using the value of a global variable when using the dollar sign operator
                            
                                R circlize: Error in circos.initialize
                            
                                Problems installing r package via devtools install_github
                            
                                World map showing day and night regions
                            
                                Gitbook chapter bibliography not in alphabetical order
                            
                                Partitioning data on a variable to speed up "fuzzy match" using stringdist
                            
                                Machine learning project: split training/test sets before or after exploratory data analysis?
                            
                                extract variables in formula from a data frame
                            
                                automatically detect date columns when reading a file into a data.frame
                            
                                plm: using fixef() to manually calculate fitted values for a fixed effects twoways model
                            
                                ggplot2 move x-axis to top (intersect with reversed y axis at 0) [duplicate]
                            
                                Drawing labels on flat section of contour lines in ggplot2
                            
                                Shiny: printing console output to a text object without waiting for a function to finish
                            
                                knitr - error when importing python module
                            
                                R Supervised Latent Dirichlet Allocation Package
                            
                                Unable to allocate vector in R with plenty of memory available
                            
                                R integration with node JS
                            
                                Connect to RServe from JAVA using authentication
                            
                                CRAN notes that files cannot be checked without ‘pandoc’ being installed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With