Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Seaborn creating an extra category in my data? [duplicate]

I am trying to plot some simple data with Seaborn 0.9.0 under Python 3.6.5. The data is just two points with a different classification from each other. The classification itself is simply 1 or 2. However when I plot it with Seaborn, the legend shows three types: 0, 1 and 2.

import numpy
import seaborn
import pandas
from matplotlib import pyplot

X = numpy.array([
    [-1, -1, 1],
    [1, 1, 2]
])

data = pandas.DataFrame(X, columns=('x','y','type'))

seaborn.scatterplot(data=data, x='x', y='y', hue='type')

pyplot.show()

The resulting plot shows:

Scatterplot with types 0, 1 and 2

I have also tried this without Pandas, just using eg x=X[:,0], y=X[:,1], hue=X[:,2], but the result is the same.

The Seaborn docs say this about the hue argument:

Can be either categorical or numeric, although color mapping will behave differently in latter case.

But they do not clarify what "categorical" means, or what the behaviour is, or how it is different. I've also read the categorical data plotting tutorial, but haven't found an answer.

Using strings like '1' and '2' in the data just results in an error:

AttributeError: 'str' object has no attribute 'view'

Why is there an extra "type" of 0 in the legend? And, for later, how can I have more meaningful category labels?


Reading the categorical data plotting tutorial some more, I found this:

If your data have a pandas Categorical datatype, then the default order of the categories can be set there. If the variable passed to the categorical axis looks numerical, the levels will be sorted. But the data are still treated as categorical and drawn at ordinal positions on the categorical axes (specifically, at 0, 1, …) even when numbers are used to label them:

This half-explains what's happening here (not why there's an extra 0 category), but even using Pandas categorical type doesn't help. Adding

data['type'] = data['type'].astype('category')

...converts this data to the categorical type, but Seaborn still gives an error:

TypeError: data type not understood
like image 636
detly Avatar asked Oct 19 '18 01:10

detly


People also ask

What is Factorplot in Seaborn?

Factor Plot is used to draw a different types of categorical plot . The default plot that is shown is a point plot, but we can plot other seaborn categorical plots by using of kind parameter, like box plots, violin plots, bar plots, or strip plots.

Why is Seaborn over matplotlib?

Seaborn is more comfortable in handling Pandas data frames. It uses basic sets of methods to provide beautiful graphics in python. Matplotlib works efficiently with data frames and arrays.It treats figures and axes as objects. It contains various stateful APIs for plotting.

What is Pointplot in Seaborn?

Show point estimates and confidence intervals using scatter plot glyphs. A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.

What is Relplot in Seaborn?

To make complex comparisons easier, Seaborn provides a function called relplot , which is short for relationship plot. relplot takes some of the same arguments as scatterplot , such as data, x, y , and hue , but adds other arguments as well.


1 Answers

You ran indeed into "numeric" colormapping here, meaning seaborn will try to use a meaningful (to itself) number of subset of the data to create a legend from it. This will at least be 3 different colors.

This may become more obvious when replacing the number 2 in the array with something large, e.g. 900

enter image description here

The solution here is indeed to activate the "categorical" mapping. The legend argument of scatterplot can take three values

legend : “brief”, “full”, or False, optional
How to draw the legend. If “brief”, numeric hue and size variables will be represented with a sample of evenly spaced values. If “full”, every group will get an entry in the legend. If False, no legend data is added and no legend is drawn.

So kind of unintuitively (at least in this case) you can set

legend="full"

to get a legend entry for every unique value in the hue column (and hence one less than using "brief").

seaborn.scatterplot(data=data, x='x', y='y', hue='type', legend="full")

enter image description here

Note that using strings as categories will work, but those strings cannot be convertable to numbers.

import numpy
import seaborn
import pandas
from matplotlib import pyplot

X = numpy.array([
    [-1, -1, "A"],
    [ 1,  1, "B"]])

data = pandas.DataFrame(X, columns=('x','y','type'))

seaborn.scatterplot(data=data, x='x', y='y', hue='type', legend="brief")

pyplot.show()

enter image description here

like image 65
ImportanceOfBeingErnest Avatar answered Oct 21 '22 03:10

ImportanceOfBeingErnest