is seaborn confidence interval computed correctly?

Tags:

First, I must admit that my statistics knowledge is rusty at best: even when it was shining new, it's not a discipline I particularly liked, which means I had a hard time making sense of it.

Nevertheless, I took a look at how the barplot graphs were calculating error bars, and was surprised to find a "confidence interval" (CI) used instead of (the more common) standard deviation. Researching more CI led me to this wikipedia article which seems to say that, basically, a CI is computed as:

mean minus 1.96 times stdev over sqrt(n)

mean plus 1.96 times stdev over sqrt(n)

Or, in pseudocode:

def ci_wp(a):
    """calculate confidence interval using Wikipedia's formula"""
    m = np.mean(a)
    s = 1.96*np.std(a)/np.sqrt(len(a))
    return m - s, m + s

But what we find in seaborn/utils.py is:

def ci(a, which=95, axis=None):
    """Return a percentile range from an array of values."""
    p = 50 - which / 2, 50 + which / 2
    return percentiles(a, p, axis)

Now maybe I'm missing this completely, but this seems just like a completely different calculation than the one proposed by Wikipedia. Can anyone explain this discrepancy?

To give another example, from comments, why do we get so different results between:

 >>> sb.utils.ci(np.arange(100))
 array([ 2.475, 96.525])

 >>> ci_wp(np.arange(100))
 [43.842250270646467,55.157749729353533]

And to compare with other statistical tools:

 def ci_std(a):
     """calculate margin of error using standard deviation"""
     m = np.mean(a)
     s = np.std(a)
     return m-s, m+s

 def ci_sem(a):
     """calculate margin of error using standard error of the mean"""
     m = np.mean(a)
     s = sp.stats.sem(a)
     return m-s, m+s

Which gives us:

>>> ci_sem(np.arange(100))
(46.598850802411796, 52.401149197588204)

>>> ci_std(np.arange(100))
(20.633929952277882, 78.366070047722118)

Or with a random sample:

rng = np.random.RandomState(10)
a = rng.normal(size=100)
print sb.utils.ci(a)
print ci_wp(a)
print ci_sem(a)
print ci_std(a)

... which yields:

[-1.9667006   2.19502303]
(-0.1101230745774124, 0.26895640045116026)
(-0.017774461397903049, 0.17660778727165088)
(-0.88762281417683186, 1.0464561400505796)

Why are Seaborn's numbers so radically different from the other results?

780

asked Sep 08 '17 22:09

anarcat

1 Answers

Your calculation using this Wikipedia formula is completely right. Seaborn just uses another method: https://en.wikipedia.org/wiki/Bootstrapping_(statistics). It's well described by Dragicevic [1]:

[It] consists of generating many alternative datasets from the experimental data by randomly drawing observations with replacement. The variability across these datasets is assumed to approximate sampling error and is used to compute so-called bootstrap confidence intervals. [...] It is very versatile and works for many kinds of distributions.

In the Seaborn's source code, a barplot uses estimate_statistic which bootstraps the data then computes the confidence interval on it:

>>> sb.utils.ci(sb.algorithms.bootstrap(np.arange(100)))
array([43.91, 55.21025])

The result is consistent with your calculation.

[1] Dragicevic, P. (2016). Fair statistical communication in HCI. In Modern Statistical Methods for HCI (pp. 291-330). Springer, Cham.

110

answered Oct 05 '22 17:10

NormandErwan

Related questions
                            
                                Django Rest Framework: empty request.data
                            
                                loc function in pandas
                            
                                Python: limit the width of printed columns of pandas DataFrame
                            
                                Pylint message: Invalid constant name (invalid-name)
                            
                                Groupby and lag all columns of a dataframe?
                            
                                Tensorflow "map operation" for tensor?
                            
                                unresolved attribute "Column" in class "SQLAlchemy"
                            
                                How to reverse a dictionary (whose values are lists) in Python?
                            
                                How can I separate runs of my TensorFlow code in TensorBoard?
                            
                                what is the correct way to check for False? [duplicate]
                            
                                why does pandas rolling use single dimension ndarray
                            
                                Object has no attribute '.__dict__' in python3
                            
                                Why can itertools.groupby group the NaNs in lists but not in numpy arrays
                            
                                keras flow_from_directory over or undersample a class
                            
                                Representation of all values in Flag enum
                            
                                How to set a window icon with PyQt5?
                            
                                Tensorflow equivalent to numpy.diff
                            
                                What is the type of traceback objects in Python?
                            
                                generating word cloud for items in a list in python
                            
                                Python aiohttp/asyncio - how to process returned data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

is seaborn confidence interval computed correctly?

Tags:

python

matplotlib

seaborn

confidence-interval

boxplot

anarcat

People also ask

1 Answers

NormandErwan

Recent Activity

Donate For Us