Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Logarithmic plot of a cumulative distribution function in matplotlib

I have a file containing logged events. Each entry has a time and latency. I'm interested in plotting the cumulative distribution function of the latencies. I'm most interested in tail latencies so I want the plot to have a logarithmic y-axis. I'm interested in the latencies at the following percentiles: 90th, 99th, 99.9th, 99.99th, and 99.999th. Here is my code so far that generates a regular CDF plot:

# retrieve event times and latencies from the file
times, latencies = read_in_data_from_file('myfile.csv')
# compute the CDF
cdfx = numpy.sort(latencies)
cdfy = numpy.linspace(1 / len(latencies), 1.0, len(latencies))
# plot the CDF
plt.plot(cdfx, cdfy)
plt.show()

Regular CDF Plot

I know what I want the plot to look like, but I've struggled to get it. I want it to look like this (I did not generate this plot):

Logarithmic CDF Plot

Making the x-axis logarithmic is simple. The y-axis is the one giving me problems. Using set_yscale('log') doesn't work because it wants to use powers of 10. I really want the y-axis to have the same ticklabels as this plot.

How can I get my data into a logarithmic plot like this one?

EDIT:

If I set the yscale to 'log', and ylim to [0.1, 1], I get the following plot:

enter image description here

The problem is that a typical log scale plot on a data set ranging from 0 to 1 will focus on values close to zero. Instead, I want to focus on the values close to 1.

like image 792
nic Avatar asked Jun 30 '15 20:06

nic


People also ask

How do you find the cumulative distribution in Python?

The easiest way to calculate normal CDF probabilities in Python is to use the norm. cdf() function from the SciPy library. What is this? The probability that a random variables takes on a value less than 1.96 in a standard normal distribution is roughly 0.975.

What does a CDF plot tell you?

The Cumulative Distribution Function (CDF) plot is a lin-lin plot with data overlay and confidence limits. It shows the cumulative density of any data set over time (i.e., Probability vs. size).

Is a histogram a CDF?

CDF, or Cumulative Distribution Function plots display exactly the same information as do histograms. The difference is that the histogram values are summed as the fluorescence intensity increases; thus, the CDF begins at 0% (origin) and ends at 100% (maximum Y value).


2 Answers

Essentially you need to apply the following transformation to your Y values: -log10(1-y). This imposes the only limitation that y < 1, so you should be able to have negative values on the transformed plot.

Here's a modified example from matplotlib documentation that shows how to incorporate custom transformations into "scales":

import numpy as np
from numpy import ma
from matplotlib import scale as mscale
from matplotlib import transforms as mtransforms
from matplotlib.ticker import FixedFormatter, FixedLocator


class CloseToOne(mscale.ScaleBase):
    name = 'close_to_one'

    def __init__(self, axis, **kwargs):
        mscale.ScaleBase.__init__(self)
        self.nines = kwargs.get('nines', 5)

    def get_transform(self):
        return self.Transform(self.nines)

    def set_default_locators_and_formatters(self, axis):
        axis.set_major_locator(FixedLocator(
                np.array([1-10**(-k) for k in range(1+self.nines)])))
        axis.set_major_formatter(FixedFormatter(
                [str(1-10**(-k)) for k in range(1+self.nines)]))


    def limit_range_for_scale(self, vmin, vmax, minpos):
        return vmin, min(1 - 10**(-self.nines), vmax)

    class Transform(mtransforms.Transform):
        input_dims = 1
        output_dims = 1
        is_separable = True

        def __init__(self, nines):
            mtransforms.Transform.__init__(self)
            self.nines = nines

        def transform_non_affine(self, a):
            masked = ma.masked_where(a > 1-10**(-1-self.nines), a)
            if masked.mask.any():
                return -ma.log10(1-a)
            else:
                return -np.log10(1-a)

        def inverted(self):
            return CloseToOne.InvertedTransform(self.nines)

    class InvertedTransform(mtransforms.Transform):
        input_dims = 1
        output_dims = 1
        is_separable = True

        def __init__(self, nines):
            mtransforms.Transform.__init__(self)
            self.nines = nines

        def transform_non_affine(self, a):
            return 1. - 10**(-a)

        def inverted(self):
            return CloseToOne.Transform(self.nines)

mscale.register_scale(CloseToOne)

if __name__ == '__main__':
    import pylab
    pylab.figure(figsize=(20, 9))
    t = np.arange(-0.5, 1, 0.00001)
    pylab.subplot(121)
    pylab.plot(t)
    pylab.subplot(122)
    pylab.plot(t)
    pylab.yscale('close_to_one')

    pylab.grid(True)
    pylab.show()

normal and transformed plot

Note that you can control the number of 9's via a keyword argument:

pylab.figure()
pylab.plot(t)
pylab.yscale('close_to_one', nines=3)
pylab.grid(True)

plot with 3 nine's

like image 124
Lev Levitsky Avatar answered Oct 05 '22 12:10

Lev Levitsky


Ok, this isn't the cleanest code, but I can't see a way around it. Maybe what I'm really asking for isn't a logarithmic CDF, but I'll wait for a statistician to tell me otherwise. Anyway, here is what I came up with:

# retrieve event times and latencies from the file
times, latencies = read_in_data_from_file('myfile.csv')
cdfx = numpy.sort(latencies)
cdfy = numpy.linspace(1 / len(latencies), 1.0, len(latencies))

# find the logarithmic CDF and ylabels
logcdfy = [-math.log10(1.0 - (float(idx) / len(latencies)))
           for idx in range(len(latencies))]
labels = ['', '90', '99', '99.9', '99.99', '99.999', '99.9999', '99.99999']
labels = labels[0:math.ceil(max(logcdfy))+1]

# plot the logarithmic CDF
fig = plt.figure()
axes = fig.add_subplot(1, 1, 1)
axes.scatter(cdfx, logcdfy, s=4, linewidths=0)
axes.set_xlim(min(latencies), max(latencies) * 1.01)
axes.set_ylim(0, math.ceil(max(logcdfy)))
axes.set_yticklabels(labels)
plt.show()

The messy part is where I change the yticklabels. The logcdfy variable will hold values between 0 and 10, and in my example it was between 0 and 6. In this code, I swap the labels with percentiles. The plot function could also be used but I like the way the scatter function shows the outliers on the tail. Also, I choose not to make the x-axis on a log scale because my particular data has a good linear line without it.

enter image description here

like image 36
nic Avatar answered Oct 05 '22 10:10

nic