Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

class labels in Pandas scattermatrix

This question has been asked before, Multiple data in scatter matrix, but didn't receive an answer.

I'd like to make a scatter matrix, something like in the pandas docs, but with differently colored markers for different classes. For example, I'd like some points to appear in green and others in blue depending on the value of one of the columns (or a separate list).

Here's an example using the Iris dataset. The color of the points represents the species of Iris -- Setosa, Versicolor, or Virginica.

iris scattermatrix with class labels

Does pandas (or matplotlib) have a way to make a chart like that?

like image 484
bgschiller Avatar asked Apr 08 '14 17:04

bgschiller


People also ask

What is Scatter_matrix in Python?

A scatterplot matrix is a matrix associated to n numerical arrays (data variables), $X_1,X_2,…,X_n$ , of the same length. The cell (i,j) of such a matrix displays the scatter plot of the variable Xi versus Xj. Here we show the Plotly Express function px.

How do I change the color of a scatter matrix?

Double click on the scatter data to open the Plot Details dialog. To change the symbol color click on the Symbol Color drop-down menu and select a color from the Individual Color option.

What does a scatter matrix show?

A scatter plot matrix is a grid (or matrix) of scatter plots used to visualize bivariate relationships between combinations of variables. Each scatter plot in the matrix visualizes the relationship between a pair of variables, allowing many relationships to be explored in one chart.


2 Answers

Update: This functionality is now in the latest version of Seaborn. Here's an example.

The following was my stopgap measure:

def factor_scatter_matrix(df, factor, palette=None):
    '''Create a scatter matrix of the variables in df, with differently colored
    points depending on the value of df[factor].
    inputs:
        df: pandas.DataFrame containing the columns to be plotted, as well 
            as factor.
        factor: string or pandas.Series. The column indicating which group 
            each row belongs to.
        palette: A list of hex codes, at least as long as the number of groups.
            If omitted, a predefined palette will be used, but it only includes
            9 groups.
    '''
    import matplotlib.colors
    import numpy as np
    from pandas.tools.plotting import scatter_matrix
    from scipy.stats import gaussian_kde

    if isinstance(factor, basestring):
        factor_name = factor #save off the name
        factor = df[factor] #extract column
        df = df.drop(factor_name,axis=1) # remove from df, so it 
        # doesn't get a row and col in the plot.

    classes = list(set(factor))

    if palette is None:
        palette = ['#e41a1c', '#377eb8', '#4eae4b', 
                   '#994fa1', '#ff8101', '#fdfc33', 
                   '#a8572c', '#f482be', '#999999']

    color_map = dict(zip(classes,palette))

    if len(classes) > len(palette):
        raise ValueError('''Too many groups for the number of colors provided.
We only have {} colors in the palette, but you have {}
groups.'''.format(len(palette), len(classes)))

    colors = factor.apply(lambda group: color_map[group])
    axarr = scatter_matrix(df,figsize=(10,10),marker='o',c=colors,diagonal=None)


    for rc in xrange(len(df.columns)):
        for group in classes:
            y = df[factor == group].icol(rc).values
            gkde = gaussian_kde(y)
            ind = np.linspace(y.min(), y.max(), 1000)
            axarr[rc][rc].plot(ind, gkde.evaluate(ind),c=color_map[group])

    return axarr, color_map

As an example, we'll use the same dataset as in the question, available here

>>> import pandas as pd
>>> iris = pd.read_csv('iris.csv')
>>> axarr, color_map = factor_scatter_matrix(iris,'Name')
>>> color_map
{'Iris-setosa': '#377eb8',
 'Iris-versicolor': '#4eae4b',
 'Iris-virginica': '#e41a1c'}

iris_scatter_matrix

Hope this is helpful!

like image 65
bgschiller Avatar answered Oct 15 '22 14:10

bgschiller


You can also call the scattermatrix from pandas as follow :

pd.scatter_matrix(df,color=colors)

with colors being an list of size len(df)containing colors

like image 44
jrjc Avatar answered Oct 15 '22 14:10

jrjc