Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MAD results differ in pandas, scipy, and numpy

I want to compute the MAD (median absolute deviation) which is defined by

MAD = median(|x_i - mean(x)|)

for a list of numbers x

x = list(range(0, 10)) + [1000]

However, the results differ significantly using numpy, pandas, and an hand-made implementation:

from scipy import stats
import pandas as pd
import numpy as np

print(stats.median_absolute_deviation(x, scale=1)) # prints 3.0

print(pd.Series(x).mad()) # prints 164.54

print(np.median(np.absolute(x - np.mean(x)))) # prints 91.0

What is wrong?

like image 834
Michael Dorner Avatar asked Feb 06 '20 10:02

Michael Dorner


People also ask

What is the main difference between Numpy and Pandas library?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

What is the difference in Pandas series and Numpy array?

The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

How indexing in Numpy and Pandas are different?

Numpy goes by rows and columns (rows first, because an element (i, j) of a matrix denotes the i th row and j th column), while Pandas works based on the columns of a database, inside which you choose elements, i.e. rows. Of course you can work directly on indices by using iloc , as you mentioned.

Is Numpy faster than Pandas?

NumPy performs better than Pandas for 50K rows or less. But, Pandas' performance is better than NumPy's for 500K rows or more. Thus, performance varies between 50K and 500K rows depending on the type of operation.

What is the difference between NumPy and pandas?

In pandas, we can import data from various file formats like JSON, SQL, Microsoft Excel, etc. Numpy: It is the fundamental library of python, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them.

How do I use SciPy with pandas?

Use Pandas to load the data. Put it in Numpy arrays. Use Scikit-learn to fit a simple model to the data. When the results don’t make sense, use Scipy to look at some singular values until you convince yourself that the results do make sense after all. Here’s a 1-minute answer (approximate, but good enough for you to know where to dig).

What is pandas in Python?

Pandas is built on the numpy library and written in languages like Python, Cython, and C. In pandas, we can import data from various file formats like JSON, SQL, Microsoft Excel, etc. Numpy: It is the fundamental library of python, used to perform scientific computing.

What are the advantages of using PANDAS in Python 3?

Python3 PANDAS NUMPY 1 When we have to work on Tabular data, we ... When we have to work on Numerical data, ... 2 The powerful tools of pandas are Data fr ... 3 Pandas consume more memory. Numpy is memory efficient. 4 Pandas has a better performance when num ... 2 more rows ...


1 Answers

The median absolute deviation is defined as:

median(|x_i - median(x)|

The method mad in Pandas returns the mean absolute deviation instead.

Test:

x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000]

stats.median_absolute_deviation(x, scale=1)
# 3.0

np.median(np.absolute(x - np.median(x)))
# 3.0

x = pd.Series(x)
(x - x.median()).abs().median()
# 3.0
like image 138
Mykola Zotko Avatar answered Oct 10 '22 21:10

Mykola Zotko