I don't understand the histograms in a Pandas scatter matrix.
I plotted a scatter matrix of the iris dataset.
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
iris = datasets.load_iris()
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8, 8], s=150, marker='D')
It looks like this.
The first histogram didn't look like it had the right frequencies, so I binned the column myself.
df['sep_len_bin'] = pd.cut(df['sepal length (cm)'], 10)
print(df.sep_len_bin.value_counts().sort_index())
I got these results. These frequencies don't appear to match the first histogram in the scatter matrix.
(4.296, 4.66] 9
(4.66, 5.02] 23
(5.02, 5.38] 14
(5.38, 5.74] 27
(5.74, 6.1] 22
(6.1, 6.46] 20
(6.46, 6.82] 18
(6.82, 7.18] 6
(7.18, 7.54] 5
(7.54, 7.9] 6
Name: sep_len_bin, dtype: int64
Then I plotted a histogram by itself.
plt.hist(df['sepal length (cm)'], bins=10)
The figure matches the bins I made. The distribution has the same shape as the first histogram in the scatter matrix, but why does the scatter matrix histogram have different frequencies?
All of the scatterplots will have units and tickmarks based on the range of the two variables being compared, whereas the diagonal subplots are histograms analyzing one variable. Notice that all of the units of the y-axes are cm
which match the data, but the histograms are not going to be in units of cm
but rather frequency.
So the frequencies for the histogram aren't shown since I guess it would be unclear where those tick marks should appear, but I agree this has the potential to be confusing.
As an aside, if you were to instead plot a kde, the tick marks aren't shown but the overall shape is correct like the histogram.
_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8, 8], s=150, marker='D', diagonal='kde')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With