I'm plotting a scatterplot matrix with Pandas, but the tick label of the first plot sometimes is plotted correctly and sometimes it's plotted incorrectly. I'm unable to figure out what's wrong!
Here's an example:
Code:
from pandas.tools.plotting import scatter_matrix
import pylab
import numpy as np
import pandas as pd
def create_scatterplot_matix(X, name):
"""
Outputs a scatterplot matrix for a design matrix.
Parameters:
-----------
X:a design matrix where each column is a feature and each row is an observation.
name: the name of the plot.
"""
pylab.figure()
df = pd.DataFrame(X)
axs = scatter_matrix(df, alpha=0.2, diagonal='kde')
for ax in axs[:,0]: # the left boundary
ax.grid('off', axis='both')
ax.set_yticks([0, .5])
for ax in axs[-1,:]: # the lower boundary
ax.grid('off', axis='both')
ax.set_xticks([0, .5])
pylab.savefig(name + ".png")
Guys, anyone?!!
Edit (example of X):
X = np.random.randn(1000000, 10)
This is intended behavior. The y-axis values are showing the y-axis values of the 0th column. The 0th row, 0th column contains a probability density graph. The 0th row, 1st-3rd columns contain the data used to create the graphs on the diagonals.
The example in the Pandas Plotting documentation looks similar.
Demonstration:
from pandas.tools.plotting import scatter_matrix
import pylab
import numpy as np
import pandas as pd
def create_scatterplot_matix(X, name):
pylab.figure()
df = pd.DataFrame(X)
axs = scatter_matrix(df, alpha=0.2, diagonal='kde')
pylab.savefig(name + ".png")
create_scatterplot_matix([[0,0,0,0]
,[1,1,1,1]
,[1,1,1,1]
,[2,2,2,2]],'test')
In this example code, I've used an extremely simple dataset for demonstration purposes. I've also removed the section of code which sets the y and x ticks.
This is the resulting plot:
In each of the diagonals is a probability density graph. In each of the non-diagonals is the data used to create the graphs in the diagonals. The y-axis of the 0th row is showing the y-axis of the probability density graph located in the 0,0th position. The y-axes of the 1st, 2nd, and 3rd rows are showing the y-axes of the data in the 0,1 0,2 and 0,3 positions used to create the probability density graphs on the diagonal.
You can see in our example, the following plotted points: [0,0] [1,1] [2,2]. The point at [1,1] is darker because there are more points at this location than at the others.
What's happening is that your dataset, all of the values are between 0 and 1, which is why 0.5 shows on both axes perfectly in the centers of the rows/columns. However, the data is heavily skewed toward the value 0, which is why the probability density graphs spike up the closer you get to 0. The max value of the probability density graph in the 0th row looks like it is (eyeball test) about 8-10.
What I would personally do is edit your left boundary code to something like this:
autoscale = True # We want the 0,0th item's y-axis to autoscale
for ax in axs[:,0]: # the left boundary
ax.grid('off', axis='both')
if autoscale == True:
ax.set_autoscale_on(True)
autoscale = False
else:
ax.set_yticks([0, 0.5])
For our example dataset, using this technique produces a chart like this:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With