Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Values getting dropped from ggplot2 histogram when specifying limits

Tags:

r

ggplot2

I'd like to create a ggplot2 histogram in which the plot's limits are equal to the smallest and largest values in the data set, without excluding those values from the actual histogram.

I get the behavior I'm looking for when using base graphics. Specifically, the second histogram below shows all of the same values as the first histogram (i.e., no bins are excluded in the second histogram), even though I've included an xlim argument to the second plot:

min_wt <- min(mtcars$wt)
max_wt <- max(mtcars$wt)
xlim <- c(min_wt, max_wt)

hist(mtcars$wt, breaks = 30, main = "No limits added")

hist(mtcars$wt, breaks = 30, xlim = xlim, main = "Limits added")

enter image description here enter image description here

ggplot2 isn't giving me this behavior though:

library(ggplot2)

# Using green colour to make dropped bins easy to see:
p <- ggplot(mtcars, aes(x = wt)) + geom_histogram(colour = "green", bins = 30)
p + ggtitle("No limits added")

p + xlim(xlim) + ggtitle("Limits added") 

enter image description here enter image description here

See how in the second plot I lose one of the points that is below 2 and 2 of the points that are above 5? I would like to know how to fix this. A few misc notes:

First, specifying boundary allows me to include the minimum values (i.e., those below 2) in the histogram, but I still don't have a solution to the 2 values greater than 5 that are getting dropped:

ggplot(mtcars, aes(x = wt)) + 
  geom_histogram(bins = 30, colour = "green", boundary = min_wt) + 
  xlim(xlim) +
  ggtitle("Limits added with boundary too")

enter image description here

Second, the presence of the issue is dependent on the value chosen for bins. For example, when I increase bins to be 50, I don't get any dropped values:

ggplot(mtcars, aes(x = wt)) + 
  geom_histogram(bins = 50, colour = "green", boundary = min_wt) + 
  xlim(xlim) +
  ggtitle("Limits added with boundary too, but with bins = 50")

enter image description here

Finally, I believe this issue is related to the one presented on SO here: geom_histogram: wrong bins? and discussed here as well: https://github.com/tidyverse/ggplot2/issues/1651. In other words, I think this issue is related to a "rounding error." I describe this error in more depth in my second post (the one with the graphs shown in it) on this issue: https://github.com/daattali/ggExtra/issues/81.

Here is my session info:

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] ggplot2_2.2.1

loaded via a namespace (and not attached):
 [1] labeling_0.3      colorspace_1.3-2  scales_0.5.0.9000
 [4] compiler_3.4.2    lazyeval_0.2.1    plyr_1.8.4       
 [7] tools_3.4.2       pillar_1.2.1      gtable_0.2.0     
[10] tibble_1.4.2      yaml_2.1.16       Rcpp_0.12.15     
[13] grid_3.4.2        rlang_0.2.0.9000  munsell_0.4.3 
like image 684
Chris Avatar asked Mar 10 '18 01:03

Chris


People also ask

How is it possible to change the number of bins in a Ggplot histogram?

To change the number of bins in the histogram using the ggplot2 package library in the R Language, we use the bins argument of the geom_histogram() function. The bins argument of the geom_histogram() function to manually set the number of bars, cells, or bins the whole histogram will be divided into.

Can you build a histogram using ggplot2?

You can also make histograms by using ggplot2 , “a plotting system for R, based on the grammar of graphics” that was created by Hadley Wickham. This post will focus on making a Histogram With ggplot2.

What does bins do in ggplot2?

To construct a histogram, the data is split into intervals called bins. The intervals may or may not be equal sized. For each bin, the number of data points that fall into it are counted (frequency). The Y axis of the histogram represents the frequency and the X axis represents the variable.

What does Binwidth mean in R?

binwidth. The width of the bins. Can be specified as a numeric value, or a function that calculates width from x. The default is to use bins bins that cover the range of the data. You should always override this value, exploring multiple widths to find the best to illustrate the stories in your data.


1 Answers

Another option to what was mentioned by @eipi10 in the comments, is to change the oob (out of bounds) argument in scale_x_continuous.

Function that handles limits outside of the scale limits (out of bounds). The default replaces out of bounds values with NA.

The default uses scales::censor(), you can change that to be oob = scales::squish, which squishes values into a range.

Compare the following two plots.

p + scale_x_continuous(limits = xlim) + ggtitle("default: scales::censor")

warning: Removed 1 rows containing missing values (geom_bar).

enter image description here

p + scale_x_continuous(limits = xlim, oob = scales::squish) + ggtitle("using scales::squish")

enter image description here

Your third ggplot, where you specified a boundary but still 2 values greater than 5 got dropped would look like this.

ggplot(mtcars, aes(x = wt)) + 
 geom_histogram(bins = 30, colour = "green", boundary = min_wt) + 
 scale_x_continuous(limits = xlim, oob = scales::squish) +
 ggtitle("Limits added with boundary too") +
 labs(subtitle = "scales::squish")

enter image description here

Hope this helps.

like image 69
markus Avatar answered Oct 11 '22 14:10

markus