Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The tick label in scatterplot with Pandas is not drawn correctly

Tags:

python

pandas

I'm plotting a scatterplot matrix with Pandas, but the tick label of the first plot sometimes is plotted correctly and sometimes it's plotted incorrectly. I'm unable to figure out what's wrong!

Here's an example:

enter image description here

Code:

from pandas.tools.plotting import scatter_matrix
import pylab
import numpy as np
import pandas as pd

def create_scatterplot_matix(X, name):    
    """
    Outputs a scatterplot matrix for a design matrix.

    Parameters:
    -----------
    X:a design matrix where each column is a feature and each row is an observation.
    name: the name of the plot.
    """
    pylab.figure()
    df = pd.DataFrame(X)
    axs = scatter_matrix(df, alpha=0.2, diagonal='kde')

    for ax in axs[:,0]: # the left boundary
        ax.grid('off', axis='both')
        ax.set_yticks([0, .5])

    for ax in axs[-1,:]: # the lower boundary
        ax.grid('off', axis='both')
        ax.set_xticks([0, .5])

    pylab.savefig(name + ".png")

Guys, anyone?!!

Edit (example of X):

X = np.random.randn(1000000, 10)
like image 436
Jack Twain Avatar asked Nov 01 '22 16:11

Jack Twain


1 Answers

This is intended behavior. The y-axis values are showing the y-axis values of the 0th column. The 0th row, 0th column contains a probability density graph. The 0th row, 1st-3rd columns contain the data used to create the graphs on the diagonals.

The example in the Pandas Plotting documentation looks similar.

Demonstration:

from pandas.tools.plotting import scatter_matrix
import pylab
import numpy as np
import pandas as pd

def create_scatterplot_matix(X, name):    
    pylab.figure()

    df = pd.DataFrame(X)
    axs = scatter_matrix(df, alpha=0.2, diagonal='kde')

    pylab.savefig(name + ".png")

create_scatterplot_matix([[0,0,0,0]
                         ,[1,1,1,1]
                         ,[1,1,1,1]
                         ,[2,2,2,2]],'test')

In this example code, I've used an extremely simple dataset for demonstration purposes. I've also removed the section of code which sets the y and x ticks.

This is the resulting plot:

enter image description here

In each of the diagonals is a probability density graph. In each of the non-diagonals is the data used to create the graphs in the diagonals. The y-axis of the 0th row is showing the y-axis of the probability density graph located in the 0,0th position. The y-axes of the 1st, 2nd, and 3rd rows are showing the y-axes of the data in the 0,1 0,2 and 0,3 positions used to create the probability density graphs on the diagonal.

You can see in our example, the following plotted points: [0,0] [1,1] [2,2]. The point at [1,1] is darker because there are more points at this location than at the others.

What's happening is that your dataset, all of the values are between 0 and 1, which is why 0.5 shows on both axes perfectly in the centers of the rows/columns. However, the data is heavily skewed toward the value 0, which is why the probability density graphs spike up the closer you get to 0. The max value of the probability density graph in the 0th row looks like it is (eyeball test) about 8-10.

What I would personally do is edit your left boundary code to something like this:

autoscale = True # We want the 0,0th item's y-axis to autoscale
for ax in axs[:,0]: # the left boundary
    ax.grid('off', axis='both')
    if autoscale == True:     
        ax.set_autoscale_on(True)
        autoscale = False
    else:
        ax.set_yticks([0, 0.5])

For our example dataset, using this technique produces a chart like this:

enter image description here

like image 153
rwflash Avatar answered Nov 20 '22 12:11

rwflash