Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Seaborn scatterplot matrix - adding extra points with custom styles

I'm doing a k-means clustering of activities on some open source projects on GitHub and am trying to plot the results together with the cluster centroids using Seaborn Scatterplot Matrix.

I can successfully plot the results of the clustering analysis (example tsv output below)

user_id issue_comments  issues_created  pull_request_review_comments    pull_requests   category
1   0.14936519790888722 2.0100502512562812  0.0 0.60790273556231    Group 0
1882    0.11202389843166542 0.5025125628140703  0.0 0.0 Group 1
2   2.315160567587752   20.603015075376884  0.13297872340425532 1.21580547112462    Group 2
1789    36.8185212845407    82.91457286432161   75.66489361702128   74.46808510638297   Group 3

The problem I'm having is that I'd like to be able to also plot the centroids of the clusters on the matrix plot too. Currently I'm my plotting script looks like this:

import seaborn as sns
import pandas as pd
from pylab import savefig
sns.set()

# By default, Pandas assumes the first column is an index
# so it will be skipped. In our case it's the user_id
data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t')

grid = sns.pairplot(data, hue="category", diag_kind="kde")
savefig('normalised_clusters.png', dpi = 150)

This produces the expected output: matrix plot

I'd like to be able to mark on each of these plots the centroids of the clusters. I can think of two ways to do this:

  1. Create a new 'CENTROID' category and just plot this together with the other points.
  2. Manually add extra points to the plots after calling sns.pairplot(data, hue="category", diag_kind="kde").

If (1) is the solution then I'd like to be able to customise the marker (perhaps a star?) to make it more prominent.

If (2) I'm all ears. I'm pretty new to Seaborn and Matplotlib so any assistance would be very welcome :-)

like image 734
arfon Avatar asked Aug 14 '15 11:08

arfon


People also ask

How do you increase the size of the scatter plot dots in Seaborn?

Size can be set by passing value to the “s” parameter. The “s” will be treated as **kwargs when you pass it in seaborn. Default value of s is 36. You can increase or decrease this number to get bigger or smaller markers respectively.

What is hue in scatterplot?

Hue can be used to group to multiple data variable and show the dependency of the passed data values are to be plotted. Syntax: seaborn.scatterplot( x, y, data, hue)

How do you change markers in Seaborn?

Changing Marker Color on a Scatter Plot Behind the scenes, Seaborn scatter plots use the Matplotlib color styles. Here are the color codes for the basic colors you can use for your scatter plot markers. Pass the value in the argument column to the color parameter to change your marker colors.

What is a pairwise plot?

A pairs plot is a matrix of scatterplots that lets you understand the pairwise relationship between different variables in a dataset.


1 Answers

pairplot isn't going to be all that well suited to this sort of thing, but it's possible to make it work with a few tricks. Here's what I would do.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
sns.set_color_codes()

# Make some random iid data
cov = np.eye(3)
ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
                np.random.multivariate_normal([1, 1, 1], cov, 50)])
ds = pd.DataFrame(ds, columns=["x", "y", "z"])

# Fit the k means model and label the observations
km = KMeans(2).fit(ds)
ds["label"] = km.labels_.astype(str)

Now comes the non-obvious part: you need to create a dataframe with the centroid locations and then combine it with the dataframe of observations while identifying the centroids as appropriate using the label column:

centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
centroids["label"] = ["0 centroid", "1 centroid"]
full_ds = pd.concat([ds, centroids], ignore_index=True)

Then you just need to use PairGrid, which is a bit more flexible than pairplot and will allow you to map other plot attributes by the hue variable along with the color (at the expense of not being able to draw histograms on the diagonals):

g = sns.PairGrid(full_ds, hue="label",
                 hue_order=["0", "1", "0 centroid", "1 centroid"],
                 palette=["b", "r", "b", "r"],
                 hue_kws={"s": [20, 20, 500, 500],
                          "marker": ["o", "o", "*", "*"]})
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()

enter image description here

An alternate solution would be to plot the observations as normal then change the data attributes on the PairGrid object and add a new layer. I'd call this a hack, but in some ways it's more straightforward.

# Plot the data
g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])

# Change the PairGrid dataset and add a new layer
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
g.data = centroids
g.hue_vals = [0, 1]
g.map_offdiag(plt.scatter, s=500, marker="*")
like image 72
mwaskom Avatar answered Oct 10 '22 17:10

mwaskom