Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas scatter matrix - what do the histograms mean?

I don't understand the histograms in a Pandas scatter matrix.

I plotted a scatter matrix of the iris dataset.

from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)

_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8, 8], s=150, marker='D')

It looks like this.

enter image description here

The first histogram didn't look like it had the right frequencies, so I binned the column myself.

df['sep_len_bin'] = pd.cut(df['sepal length (cm)'], 10)
print(df.sep_len_bin.value_counts().sort_index())

I got these results. These frequencies don't appear to match the first histogram in the scatter matrix.

(4.296, 4.66]     9
(4.66, 5.02]     23
(5.02, 5.38]     14
(5.38, 5.74]     27
(5.74, 6.1]      22
(6.1, 6.46]      20
(6.46, 6.82]     18
(6.82, 7.18]      6
(7.18, 7.54]      5
(7.54, 7.9]       6
Name: sep_len_bin, dtype: int64

Then I plotted a histogram by itself.

plt.hist(df['sepal length (cm)'], bins=10)

The figure matches the bins I made. The distribution has the same shape as the first histogram in the scatter matrix, but why does the scatter matrix histogram have different frequencies?

enter image description here

like image 1000
Jai Jeffryes Avatar asked Sep 03 '25 14:09

Jai Jeffryes


1 Answers

All of the scatterplots will have units and tickmarks based on the range of the two variables being compared, whereas the diagonal subplots are histograms analyzing one variable. Notice that all of the units of the y-axes are cm which match the data, but the histograms are not going to be in units of cm but rather frequency.

So the frequencies for the histogram aren't shown since I guess it would be unclear where those tick marks should appear, but I agree this has the potential to be confusing.

As an aside, if you were to instead plot a kde, the tick marks aren't shown but the overall shape is correct like the histogram.

_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8, 8], s=150, marker='D', diagonal='kde')

enter image description here

like image 190
Derek O Avatar answered Sep 05 '25 15:09

Derek O