Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I plot a histogram of a long-tailed data using R?

Tags:

r

histogram

I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.

Yes, i know this means not all bins are of equal size

A simple hist(x) gives alt text while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives alt text

none of which is what I want.

update following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):

breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]

alt text the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.

like image 752
David B Avatar asked Oct 05 '10 09:10

David B


People also ask

How do you make a histogram for continuous data in R?

R uses hist () function to create histograms. This hist () function uses a vector of values to plot the histogram. Histogram comprises of an x-axis range of continuous values, y-axis plots frequent values of data in the x-axis with bars of variations of heights. break – specifies the width of each bar.

How do you plot a histogram curve in R?

A basic histogram can be created with the hist function. In order to add a normal curve or the density line you will need to create a density histogram setting prob = TRUE as argument.

Which method is used to create a histogram in R?

We can create histogram in R Programming Language using hist() function.


1 Answers

Log scale histograms are easier with ggplot than with base graphics. Try something like

library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()

If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.

h <- hist(log10(dfr$x), axes = FALSE) 
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)

For completeness, the lattice solution would be

library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))

AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:

If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.

hist(dfr$x)

The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.

hist(dfr$x, log = "y")

Neither does this.

par(xlog = TRUE)
hist(dfr$x)

That means that we need to log transform the data before we draw the plot.

    hist(log10(dfr$x))

Unfortunately, this messes up the axes, which brings us to workaround above.

like image 66
Richie Cotton Avatar answered Oct 24 '22 11:10

Richie Cotton