A common idiom (found in books, tutorials, and on many Stack Overflow questions) is to use df
as a sort of throw-away identifier for a dataframe. I've done so hundreds of times with seemingly no ill-effect, but then ran into the following code:
library(tree)
df <- droplevels(iris[1:100,c(1,2,5)])
tr <- tree(Species ~ ., data = df)
plot(tr)
text(tr)
partition.tree(tr)
This gives the following error message:
Error in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ""function"" to a data.frame
I discovered by trial and error that if I simply replace df
above by df2
, the code works as expected. It is true that df
is the name of the density function for the F-distribution, but that doesn't seem to be remotely relevant here. Is this a bug in the tree
package, or is it an important cautionary tale whose moral is that I should avoid using df
as the name for a dataframe since doing so introduces a name-clash?
You can use the rename() method of pandas. DataFrame to change column/index name individually. Specify the original name and the new name in dict like {original name: new name} to columns / index parameter of rename() .
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
Pandas at[] is used to return data in a dataframe at the passed location. The passed location is in the format [position, Column Name]. This method works in a similar way to Pandas loc[ ] but at[ ] is used to return an only single value and hence works faster than it.
Because the potential name conflict would make errors more difficult to debug, I forced myself to use dtf
instead of df
for a long time. However important collection of package in the tidyverse seem to be ok with using df
everywhere in their tests, for example test-select.r:
df <- tibble(g = 1:3, x = 3:1) %>% group_by(g)
I've been using df
a lot recently to name python pandas data frames. So I tend to use df
in R as well nowadays. Let's see if this bites back.
The question of namespace is not part of the original question but it is related to this issue of name conflict with df
. A flat name space is easier and fun to use in exploratory data analysis, you just call all functions directly, but it can lead to collisions. A nested namespace makes debugging more reliable at the cost of being a little more cumbersome, because you have to prefix each function call with the package name.
Name space collisions are less of an issue in python because it has a more nested namespace. For example you import numpy as np
and prefix all numpy function calls with np
, such as np.array()
. (It's possible to do from numpy import *
but it is frowned upon and linters typically complain about it).
In R you have to distinguish trash code used in exploratory data analysis from more durable code that you are going to reuse. In the second case, if you use only one or a few functions from another package, it's better not to import the package but to call the functions you really need with library(package_name)
package_name::function
.
Is this a bug in the tree package, or is it an important cautionary tale whose moral is that I should avoid using df as the name for a dataframe since doing so introduces a name-clash?
I think in this case it may be both, but for your purposes I would take it more as a cautionary example. The fact that it causes an error here indicates that it may not be the best practice.
In my experience R does not manage namespaces very well (comparing it to Python, for example). Because of this, it may have been unwise for the authors of tree to introduce (intentionally or not) a conflict with df
- which is a common throwaway name for a dataframe - if in fact they did so (see comments here and in the question; it is unclear whether this is a clash in data.frame names or improper use of eval() causing clashes between data.frame objects and functions).
With that said, it is a good example of why namespaces are important and (IMO) suggestive of how to write better R code. I think namespaces are being introduced to the R ecosystem, but my experience with R is that there is a lot of namespace 'flatness' and lots of opportunities for name conflicts. For this reason I would suggest that you take this as a reason to use more descriptive / unique identifiers for your own variables. This avoids conflicts like the one you encountered, and provides some future-proofing to help avoid conflicts creeping into previously working code if package internals change.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With