I'm doing a k-means clustering of activities on some open source projects on GitHub and am trying to plot the results together with the cluster centroids using Seaborn Scatterplot Matrix. I can successfully plot the results of the clustering analysis (example tsv output below) <pre class="prettyprint lang-text prettyprint-override"><code>user_id issue_comments issues_created pull_request_review_comments pull_requests category 1 0.14936519790888722 2.0100502512562812 0.0 0.60790273556231 Group 0 1882 0.11202389843166542 0.5025125628140703 0.0 0.0 Group 1 2 2.315160567587752 20.603015075376884 0.13297872340425532 1.21580547112462 Group 2 1789 36.8185212845407 82.91457286432161 75.66489361702128 74.46808510638297 Group 3 </code></pre> The problem I'm having is that I'd like to be able to also plot the centroids of the clusters on the matrix plot too. Currently I'm my plotting script looks like this: <pre class="prettyprint lang-py prettyprint-override"><code>import seaborn as sns import pandas as pd from pylab import savefig sns.set() # By default, Pandas assumes the first column is an index # so it will be skipped. In our case it's the user_id data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t') grid = sns.pairplot(data, hue="category", diag_kind="kde") savefig('normalised_clusters.png', dpi = 150) </code></pre> This produces the expected output: <img src="https://i.stack.imgur.com/Nwqh7.png" alt="matrix plot"> I'd like to be able to mark on each of these plots the centroids of the clusters. I can think of two ways to do this: <ol> <li>Create a new 'CENTROID' category and just plot this together with the other points.</li> <li>Manually add extra points to the plots after calling <code>sns.pairplot(data, hue="category", diag_kind="kde")</code>.</li> </ol> If (1) is the solution then I'd like to be able to customise the marker (perhaps a star?) to make it more prominent. If (2) I'm all ears. I'm pretty new to Seaborn and Matplotlib so any assistance would be very welcome :-)

<code>pairplot</code> isn't going to be all that well suited to this sort of thing, but it's possible to make it work with a few tricks. Here's what I would do. <pre class="prettyprint"><code>import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.cluster import KMeans sns.set_color_codes() # Make some random iid data cov = np.eye(3) ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50), np.random.multivariate_normal([1, 1, 1], cov, 50)]) ds = pd.DataFrame(ds, columns=["x", "y", "z"]) # Fit the k means model and label the observations km = KMeans(2).fit(ds) ds["label"] = km.labels_.astype(str) </code></pre> Now comes the non-obvious part: you need to create a dataframe with the centroid locations and then combine it with the dataframe of observations while identifying the centroids as appropriate using the <code>label</code> column: <pre class="prettyprint"><code>centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"]) centroids["label"] = ["0 centroid", "1 centroid"] full_ds = pd.concat([ds, centroids], ignore_index=True) </code></pre> Then you just need to use <code>PairGrid</code>, which is a bit more flexible than <code>pairplot</code> and will allow you to map other plot attributes by the hue variable along with the color (at the expense of not being able to draw histograms on the diagonals): <pre class="prettyprint"><code>g = sns.PairGrid(full_ds, hue="label", hue_order=["0", "1", "0 centroid", "1 centroid"], palette=["b", "r", "b", "r"], hue_kws={"s": [20, 20, 500, 500], "marker": ["o", "o", "*", "*"]}) g.map(plt.scatter, linewidth=1, edgecolor="w") g.add_legend() </code></pre> <img src="https://i.stack.imgur.com/mVRB1.png" alt="enter image description here"> An alternate solution would be to plot the observations as normal then change the data attributes on the <code>PairGrid</code> object and add a new layer. I'd call this a hack, but in some ways it's more straightforward. <pre class="prettyprint"><code># Plot the data g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"]) # Change the PairGrid dataset and add a new layer centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"]) g.data = centroids g.hue_vals = [0, 1] g.map_offdiag(plt.scatter, s=500, marker="*") </code></pre>

Seaborn scatterplot matrix - adding extra points with custom styles

Tags:

matplotlib

seaborn

I'm doing a k-means clustering of activities on some open source projects on GitHub and am trying to plot the results together with the cluster centroids using Seaborn Scatterplot Matrix.

I can successfully plot the results of the clustering analysis (example tsv output below)

user_id issue_comments  issues_created  pull_request_review_comments    pull_requests   category
1   0.14936519790888722 2.0100502512562812  0.0 0.60790273556231    Group 0
1882    0.11202389843166542 0.5025125628140703  0.0 0.0 Group 1
2   2.315160567587752   20.603015075376884  0.13297872340425532 1.21580547112462    Group 2
1789    36.8185212845407    82.91457286432161   75.66489361702128   74.46808510638297   Group 3

The problem I'm having is that I'd like to be able to also plot the centroids of the clusters on the matrix plot too. Currently I'm my plotting script looks like this:

import seaborn as sns
import pandas as pd
from pylab import savefig
sns.set()

# By default, Pandas assumes the first column is an index
# so it will be skipped. In our case it's the user_id
data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t')

grid = sns.pairplot(data, hue="category", diag_kind="kde")
savefig('normalised_clusters.png', dpi = 150)

This produces the expected output: matrix plot

I'd like to be able to mark on each of these plots the centroids of the clusters. I can think of two ways to do this:

Create a new 'CENTROID' category and just plot this together with the other points.
Manually add extra points to the plots after calling sns.pairplot(data, hue="category", diag_kind="kde").

If (1) is the solution then I'd like to be able to customise the marker (perhaps a star?) to make it more prominent.

If (2) I'm all ears. I'm pretty new to Seaborn and Matplotlib so any assistance would be very welcome :-)

734

asked Aug 14 '15 11:08

arfon

1 Answers

pairplot isn't going to be all that well suited to this sort of thing, but it's possible to make it work with a few tricks. Here's what I would do.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
sns.set_color_codes()

# Make some random iid data
cov = np.eye(3)
ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
                np.random.multivariate_normal([1, 1, 1], cov, 50)])
ds = pd.DataFrame(ds, columns=["x", "y", "z"])

# Fit the k means model and label the observations
km = KMeans(2).fit(ds)
ds["label"] = km.labels_.astype(str)

Now comes the non-obvious part: you need to create a dataframe with the centroid locations and then combine it with the dataframe of observations while identifying the centroids as appropriate using the label column:

centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
centroids["label"] = ["0 centroid", "1 centroid"]
full_ds = pd.concat([ds, centroids], ignore_index=True)

Then you just need to use PairGrid, which is a bit more flexible than pairplot and will allow you to map other plot attributes by the hue variable along with the color (at the expense of not being able to draw histograms on the diagonals):

g = sns.PairGrid(full_ds, hue="label",
                 hue_order=["0", "1", "0 centroid", "1 centroid"],
                 palette=["b", "r", "b", "r"],
                 hue_kws={"s": [20, 20, 500, 500],
                          "marker": ["o", "o", "*", "*"]})
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()

enter image description here

An alternate solution would be to plot the observations as normal then change the data attributes on the PairGrid object and add a new layer. I'd call this a hack, but in some ways it's more straightforward.

# Plot the data
g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])

# Change the PairGrid dataset and add a new layer
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
g.data = centroids
g.hue_vals = [0, 1]
g.map_offdiag(plt.scatter, s=500, marker="*")

answered Oct 10 '22 17:10

mwaskom

Related questions
                            
                                how to find the correlation between two images
                            
                                Updating marker style in scatter plot with matplotlib
                            
                                Specify RGB colour of contours with matplotlib
                            
                                Save a figure with multiple extensions?
                            
                                Using matplotlib, is it possible to set properties for all subplots on a figure at once?
                            
                                Finding the intersection of a curve from polyfit
                            
                                legend alignment in matplotlib
                            
                                Plotting stochastic processes in Python
                            
                                Adjust the distance only between two subplots in matplotlib
                            
                                How to label axes in Matplotlib using LaTeX brackets?
                            
                                matplotlib legend: Including markers and lines from two different graphs in one line
                            
                                Why bmp image displayed as wrong color with plt.imshow of matplotlib on IPython-notebook?
                            
                                Plot arrays of different lengths
                            
                                Label objects not found
                            
                                how to set bounds for the x-axis in one figure containing multiple matplotlib histograms and create just one column of graphs?
                            
                                Ignoring plotting data points of certain value
                            
                                Python: scatter plot with aligned annotations at each data point
                            
                                Plotting a function of three variables in python
                            
                                How do I plot GFS grib2 data with Python?
                            
                                How to change fontsize of individual legend entries in pyplot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With