I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the histogram bins. I put here an example with left-closed bins (i.e. [ 0, 0.1 [ ) with a binwidth of 0.1.
mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left")
myplot
ggplot_build(myplot)$data[[1]]
On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.
Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?
Thanks in advance, Best regards, Arnaud
PS: Also asked here: https://github.com/hadley/ggplot2/issues/1651
Edit: The problem described below was fixed in a recent release of ggplot2
.
Your issue is reproducible and appears to be caused by rounding errors, as suggested in the comments by Roland. At this point, this looks to me like a bug introduced in version ggplot2_2.0.0
. I speculate below about its origin, but first let me present a workaround based on the boundary
option.
PROBLEM:
df <- data.frame(var = seq(-100,100,10)/100)
as.list(df) # check the data
$var
[1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2
[10] -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
[19] 0.8 0.9 1.0
library("ggplot2")
p <- ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.1,
boundary = 0.1,
closed = "left")
p
SOLUTION
Tweak the boundary
parameter. In this example, setting just below 1, say 0.99, works. Your use case should be amenable to tweaking too.
ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
boundary = 0.99,
closed = "left")
(I have made the binwidth narrower for better visual)
Another workaround is to introduce your own fuzziness, e.g. multiply the data by 1 plus slightly less than the machine zero (see eps
below). In ggplot2
the fuzziness multiplies by 1e-7 (earlier versions) or 1e-8 (later versions).
CAUSE:
The problem appears clearly in ncount
:
str(ggplot_build(p)$data[[1]])
## 'data.frame': 20 obs. of 17 variables:
## $ y : num 1 1 1 1 1 2 1 1 1 0 ...
## $ count : num 1 1 1 1 1 2 1 1 1 0 ...
## $ x : num -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ...
## $ xmin : num -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ...
## $ xmax : num -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ...
## $ density : num 0.476 0.476 0.476 0.476 0.476 ...
## $ ncount : num 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ...
## $ ndensity: num 1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ...
## $ PANEL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax : num 1 1 1 1 1 2 1 1 1 0 ...
## $ colour : logi NA NA NA NA NA NA ...
## $ fill : chr "grey35" "grey35" "grey35" "grey35" ...
## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype: num 1 1 1 1 1 1 1 1 1 1 ...
## $ alpha : logi NA NA NA NA NA NA ...
ggplot_build(p)$data[[1]]$ncount
## [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5
## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5
ROUNDING ERRORS?
Looks like:
df <- data.frame(var = as.integer(seq(-100,100,10)))
# eps <- 1.000000000000001 # on my system
eps <- 1+10*.Machine$double.eps
p <- ggplot(data = df, aes(x = eps*var/100)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
closed = "left")
p
(I have removed the boundary
option altogether)
This behaviour appears some time after ggplot2_1.0.1
. Looking at the source code, e.g. bin.R
and stat-bin.r
in https://github.com/hadley/ggplot2/blob/master/R
, and tracing the computations of count
leads to function bin_vector()
, which contains the following lines:
bin_vector <- function(x, bins, weight = NULL, pad = FALSE) {
... STUFF HERE I HAVE DELETED FOR CLARITY ...
cut(x, bins$breaks, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
}
By comparing the current versions of these functions with older ones, you should be able to find the reason for the different behaviour... to be continued...
SUMMING UP DEBUGGING
By "patching"
the bin_vector
function and printing the output to screen, it appears that:
bins$fuzzy
correctly stores the fuzzy parameters
The non-fuzzy bins$breaks
are used in the computations, but as far as I can see (and correct me if I'm wrong) the bins$fuzzy
are not.
If I simply replace bins$breaks
with bins$fuzzy
at the top of bin_vector
, the correct plot is returned. Not a proof of a bug, but a suggestion that perhaps more could be done to emulate the behaviour of previous versions of ggplot2
.
At the top of bin_vector
I expected to find a condition upon which to return either bins$breaks
or bins$fuzzy
. I think that's missing now.
PATCHING
To "patch"
the bin_vector
function, copy the function definition from the github source or, more conveniently, from the terminal, with:
ggplot2:::bin_vector
Modify it (patch it) and assign it into the namespace:
library("ggplot2")
bin_vector <- function (x, bins, weight = NULL, pad = FALSE)
{
... STUFF HERE I HAVE DELETED FOR CLARITY ...
## MY PATCH: Replace bins$breaks with bins$fuzzy
bin_idx <- cut(x, bins$fuzzy, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
ggplot2:::bin_out(bin_count, bin_x, bin_widths)
## THIS IS THE PATCHED FUNCTION
}
assignInNamespace("bin_vector", bin_vector, ns = "ggplot2")
df <- data.frame(var = seq(-100,100,10)/100)
ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, boundary = 1, closed = "left")
Just to be clear, the code above is edited for clarity: the function has a lot of type-checking and other calculations which I have removed, but which you would need to patch the function. Before you run the patch, restart your R session or detach
your currently loaded ggplot2
.
OLD VERSIONS
The unexpected behaviour is NOT observed in versions 2.0.9.3
or 2.1.0.1
and appears to originate in the current release 2.2.0.1
(or perhaps the earlier 2.2.0.0
, which gave me an error when I tried to call it).
To install and load an old version, say ggplot2_0.9.3
, create a separate directory (no point in overwriting the current version), say ggplot2093
:
URL <- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz"
install.packages(URL, repos = NULL, type = "source",
lib = "~/R/testing/ggplot2093")
To load the old version, call it from your local directory:
library("ggplot2", lib.loc = "~/R/testing/ggplot2093")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With