Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I plot ca. 20 million points as a scatterplot?

I am trying to create a scatterplot with matplotlib that consists of ca. ca. 20 million data points. Even after setting the alpha value to its lowest before ending up with no visible data at all the result is just a completely black plot.

plt.scatter(timedPlotData, plotData, alpha=0.01, marker='.')

The x-axis is a continuous timeline of about 2 months and the y-axis consists of 150k consecutive integer values.

Is there any way to plot all the points so that their distribution over time is still visible?

Thank you for your help.

like image 587
FrozenSUSHI Avatar asked Sep 18 '13 18:09

FrozenSUSHI


People also ask

How do you plot a scatter plot?

Draw a graph with the independent variable on the horizontal axis and the dependent variable on the vertical axis. For each pair of data, put a dot or a symbol where the x-axis value intersects the y-axis value. (If two dots fall together, put them side by side, touching, so that you can see both.)


1 Answers

There's more than one way to do this. A lot of folks have suggested a heatmap/kernel-density-estimate/2d-histogram. @Bucky suggesed using a moving average. In addition, you can fill between a moving min and moving max, and plot the moving mean over the top. I often call this a "chunkplot", but that's a terrible name. The implementation below assumes that your time (x) values are monotonically increasing. If they're not, it's simple enough to sort y by x before "chunking" in the chunkplot function.

Here are a couple of different ideas. Which is best will depend on what you want to emphasize in the plot. Note that this will be rather slow to run, but that's mostly due to the scatterplot. The other plotting styles are much faster.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
np.random.seed(1977)

def main():
    x, y = generate_data()
    fig, axes = plt.subplots(nrows=3, sharex=True)
    for ax in axes.flat:
        ax.xaxis_date()
    fig.autofmt_xdate()

    axes[0].set_title('Scatterplot of all data')
    axes[0].scatter(x, y, marker='.')

    axes[1].set_title('"Chunk" plot of data')
    chunkplot(x, y, chunksize=1000, ax=axes[1],
              edgecolor='none', alpha=0.5, color='gray')

    axes[2].set_title('Hexbin plot of data')
    axes[2].hexbin(x, y)

    plt.show()

def generate_data():
    # Generate a very noisy but interesting timeseries
    x = mdates.drange(dt.datetime(2010, 1, 1), dt.datetime(2013, 9, 1),
                      dt.timedelta(minutes=10))
    num = x.size
    y = np.random.random(num) - 0.5
    y.cumsum(out=y)
    y += 0.5 * y.max() * np.random.random(num)
    return x, y

def chunkplot(x, y, chunksize, ax=None, line_kwargs=None, **kwargs):
    if ax is None:
        ax = plt.gca()
    if line_kwargs is None:
        line_kwargs = {}
    # Wrap the array into a 2D array of chunks, truncating the last chunk if
    # chunksize isn't an even divisor of the total size.
    # (This part won't use _any_ additional memory)
    numchunks = y.size // chunksize
    ychunks = y[:chunksize*numchunks].reshape((-1, chunksize))
    xchunks = x[:chunksize*numchunks].reshape((-1, chunksize))

    # Calculate the max, min, and means of chunksize-element chunks...
    max_env = ychunks.max(axis=1)
    min_env = ychunks.min(axis=1)
    ycenters = ychunks.mean(axis=1)
    xcenters = xchunks.mean(axis=1)

    # Now plot the bounds and the mean...
    fill = ax.fill_between(xcenters, min_env, max_env, **kwargs)
    line = ax.plot(xcenters, ycenters, **line_kwargs)[0]
    return fill, line

main()

enter image description here

like image 186
Joe Kington Avatar answered Oct 06 '22 02:10

Joe Kington