I'd like to create a ggplot2 histogram in which the plot's limits are equal to the smallest and largest values in the data set, without excluding those values from the actual histogram.
I get the behavior I'm looking for when using base graphics. Specifically, the second histogram below shows all of the same values as the first histogram (i.e., no bins are excluded in the second histogram), even though I've included an xlim
argument to the second plot:
min_wt <- min(mtcars$wt)
max_wt <- max(mtcars$wt)
xlim <- c(min_wt, max_wt)
hist(mtcars$wt, breaks = 30, main = "No limits added")
hist(mtcars$wt, breaks = 30, xlim = xlim, main = "Limits added")
ggplot2 isn't giving me this behavior though:
library(ggplot2)
# Using green colour to make dropped bins easy to see:
p <- ggplot(mtcars, aes(x = wt)) + geom_histogram(colour = "green", bins = 30)
p + ggtitle("No limits added")
p + xlim(xlim) + ggtitle("Limits added")
See how in the second plot I lose one of the points that is below 2 and 2 of the points that are above 5? I would like to know how to fix this. A few misc notes:
First, specifying boundary
allows me to include the minimum values (i.e., those below 2) in the histogram, but I still don't have a solution to the 2 values greater than 5 that are getting dropped:
ggplot(mtcars, aes(x = wt)) +
geom_histogram(bins = 30, colour = "green", boundary = min_wt) +
xlim(xlim) +
ggtitle("Limits added with boundary too")
Second, the presence of the issue is dependent on the value chosen for bins
. For example, when I increase bins
to be 50, I don't get any dropped values:
ggplot(mtcars, aes(x = wt)) +
geom_histogram(bins = 50, colour = "green", boundary = min_wt) +
xlim(xlim) +
ggtitle("Limits added with boundary too, but with bins = 50")
Finally, I believe this issue is related to the one presented on SO here: geom_histogram: wrong bins? and discussed here as well: https://github.com/tidyverse/ggplot2/issues/1651. In other words, I think this issue is related to a "rounding error." I describe this error in more depth in my second post (the one with the graphs shown in it) on this issue: https://github.com/daattali/ggExtra/issues/81.
Here is my session info:
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] ggplot2_2.2.1
loaded via a namespace (and not attached):
[1] labeling_0.3 colorspace_1.3-2 scales_0.5.0.9000
[4] compiler_3.4.2 lazyeval_0.2.1 plyr_1.8.4
[7] tools_3.4.2 pillar_1.2.1 gtable_0.2.0
[10] tibble_1.4.2 yaml_2.1.16 Rcpp_0.12.15
[13] grid_3.4.2 rlang_0.2.0.9000 munsell_0.4.3
To change the number of bins in the histogram using the ggplot2 package library in the R Language, we use the bins argument of the geom_histogram() function. The bins argument of the geom_histogram() function to manually set the number of bars, cells, or bins the whole histogram will be divided into.
You can also make histograms by using ggplot2 , “a plotting system for R, based on the grammar of graphics” that was created by Hadley Wickham. This post will focus on making a Histogram With ggplot2.
To construct a histogram, the data is split into intervals called bins. The intervals may or may not be equal sized. For each bin, the number of data points that fall into it are counted (frequency). The Y axis of the histogram represents the frequency and the X axis represents the variable.
binwidth. The width of the bins. Can be specified as a numeric value, or a function that calculates width from x. The default is to use bins bins that cover the range of the data. You should always override this value, exploring multiple widths to find the best to illustrate the stories in your data.
Another option to what was mentioned by @eipi10 in the comments, is to change the oob
(out of bounds) argument in scale_x_continuous
.
Function that handles limits outside of the scale limits (out of bounds). The default replaces out of bounds values with NA.
The default uses scales::censor()
, you can change that to be oob = scales::squish
, which squishes values into a range.
Compare the following two plots.
p + scale_x_continuous(limits = xlim) + ggtitle("default: scales::censor")
warning: Removed 1 rows containing missing values (geom_bar).
p + scale_x_continuous(limits = xlim, oob = scales::squish) + ggtitle("using scales::squish")
Your third ggplot
, where you specified a boundary but still 2 values greater than 5 got dropped would look like this.
ggplot(mtcars, aes(x = wt)) +
geom_histogram(bins = 30, colour = "green", boundary = min_wt) +
scale_x_continuous(limits = xlim, oob = scales::squish) +
ggtitle("Limits added with boundary too") +
labs(subtitle = "scales::squish")
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With