Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr masks GGally and breaks ggparcoord

Given a fresh session, executing a small ggparcoord(.) example provided in the documentation of the function

library(GGally)

data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = diamonds.samp, columns = c(1, 5:10))

results into the following plot:

enter image description here

Again, starting in a fresh session and executing the same script with the loaded dplyr

library(GGally)
library(dplyr)

data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = diamonds.samp, columns = c(1, 5:10))

results in:

Error: (list) object cannot be coerced to type 'double'

Note that the order of the library(.) statements does not matter.

Questions

  1. Is there something wrong with the code samples?
  2. Is there a way to overcome the problem (over some namespace functions)?
  3. Or is this a bug?

I need both dplyr and ggparcoord(.) in a bigger analysis but this minimal example reflects the problem i am facing.

Versions

  • R @ 3.2.3
  • dplyr @ 0.4.3
  • GGally @ 1.0.1
  • ggplot @ 2.0.0

UPDATE

To wrap the excellent answer given by Joran up:

Answers

  1. The code samples are in fact wrong as ggparcoord(.) expects a data.frame not a tbl_df as given by the diamonds data set (if dplyr is loaded).
  2. The problem is solved by coercing the tbl_df to a data.frame.
  3. No it is not a bug.

Working code sample:

library(GGally)
library(dplyr)

data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = as.data.frame(diamonds.samp), columns = c(1, 5:10))
like image 573
Hannes Avatar asked Feb 10 '16 22:02

Hannes


2 Answers

Converting my comments to an answer...

The GGally package here is making the reasonable assumption that using [ on a data frame should behave the way it always does and always has. However, this all being in the Hadley-verse, the diamonds data set is a tbl_df as well as a data.frame.

When dplyr is loaded, the behavior of [ is overridden such that drop = FALSE is always the default for a tbl_df. So there's a place in GGally where data[,"cut"] is expected to return a vector, but instead it returns another data frame.

...specifically, the error is thrown in your example while attempting to execute:

data[, fact.var] <- as.numeric(data[, fact.var]). 

Since data[,fact.var] remains a data frame, and hence a list, as.numeric won't work.

As for your conclusion that this isn't a bug, I'd say....maybe. Probably. At least there probably isn't anything the GGally package author ought to do to address it. You just have to be aware that using tbl_df's with non-Hadley written packages may break things.

As you noted, removing the extra class attributes fixes the problem, as it returns R to using the normal [ method.

like image 42
joran Avatar answered Sep 23 '22 07:09

joran


Workaround: coerce your data for ggparcoord to as.data.table(...) or as.data.table(... , keep.rownames=TRUE) unless you want to lose all your rownames.

Cause: as per @joran's investigating, when dplyr is loaded, tbl_df overrides [ so that drop = FALSE.

Solution: file a pull-request on GGally. edit: fixed in v1.3.0 (https://github.com/ggobi/ggally/commit/bfa930d102289d723de2ce9ec528baf42b3b7b40)

like image 99
smci Avatar answered Sep 23 '22 07:09

smci