Pandas scatter matrix - what do the histograms mean?

Question

I don't understand the histograms in a Pandas scatter matrix.

I plotted a scatter matrix of the iris dataset.

from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)

_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8, 8], s=150, marker='D')

It looks like this.

enter image description here

The first histogram didn't look like it had the right frequencies, so I binned the column myself.

df['sep_len_bin'] = pd.cut(df['sepal length (cm)'], 10)
print(df.sep_len_bin.value_counts().sort_index())

I got these results. These frequencies don't appear to match the first histogram in the scatter matrix.

(4.296, 4.66]     9
(4.66, 5.02]     23
(5.02, 5.38]     14
(5.38, 5.74]     27
(5.74, 6.1]      22
(6.1, 6.46]      20
(6.46, 6.82]     18
(6.82, 7.18]      6
(7.18, 7.54]      5
(7.54, 7.9]       6
Name: sep_len_bin, dtype: int64

Then I plotted a histogram by itself.

plt.hist(df['sepal length (cm)'], bins=10)

The figure matches the bins I made. The distribution has the same shape as the first histogram in the scatter matrix, but why does the scatter matrix histogram have different frequencies?

enter image description here

Derek O · Accepted Answer

All of the scatterplots will have units and tickmarks based on the range of the two variables being compared, whereas the diagonal subplots are histograms analyzing one variable. Notice that all of the units of the y-axes are cm which match the data, but the histograms are not going to be in units of cm but rather frequency.

So the frequencies for the histogram aren't shown since I guess it would be unclear where those tick marks should appear, but I agree this has the potential to be confusing.

As an aside, if you were to instead plot a kde, the tick marks aren't shown but the overall shape is correct like the histogram.

_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8, 8], s=150, marker='D', diagonal='kde')

enter image description here

Pandas scatter matrix - what do the histograms mean?

Tags:

python

pandas

matplotlib

Jai Jeffryes

1 Answers

Derek O

Recent Activity

Donate For Us

Pandas scatter matrix - what do the histograms mean?

Tags:

python

pandas

matplotlib

Jai Jeffryes

1 Answers

Derek O

Related questions

Recent Activity

Donate For Us