I want to compute the MAD (median absolute deviation) which is defined by
MAD = median(|x_i - mean(x)|)
for a list of numbers x
x = list(range(0, 10)) + [1000]
However, the results differ significantly using numpy
, pandas
, and an hand-made implementation:
from scipy import stats
import pandas as pd
import numpy as np
print(stats.median_absolute_deviation(x, scale=1)) # prints 3.0
print(pd.Series(x).mad()) # prints 164.54
print(np.median(np.absolute(x - np.mean(x)))) # prints 91.0
What is wrong?
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
Numpy goes by rows and columns (rows first, because an element (i, j) of a matrix denotes the i th row and j th column), while Pandas works based on the columns of a database, inside which you choose elements, i.e. rows. Of course you can work directly on indices by using iloc , as you mentioned.
NumPy performs better than Pandas for 50K rows or less. But, Pandas' performance is better than NumPy's for 500K rows or more. Thus, performance varies between 50K and 500K rows depending on the type of operation.
In pandas, we can import data from various file formats like JSON, SQL, Microsoft Excel, etc. Numpy: It is the fundamental library of python, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them.
Use Pandas to load the data. Put it in Numpy arrays. Use Scikit-learn to fit a simple model to the data. When the results don’t make sense, use Scipy to look at some singular values until you convince yourself that the results do make sense after all. Here’s a 1-minute answer (approximate, but good enough for you to know where to dig).
Pandas is built on the numpy library and written in languages like Python, Cython, and C. In pandas, we can import data from various file formats like JSON, SQL, Microsoft Excel, etc. Numpy: It is the fundamental library of python, used to perform scientific computing.
Python3 PANDAS NUMPY 1 When we have to work on Tabular data, we ... When we have to work on Numerical data, ... 2 The powerful tools of pandas are Data fr ... 3 Pandas consume more memory. Numpy is memory efficient. 4 Pandas has a better performance when num ... 2 more rows ...
The median absolute deviation is defined as:
median(|x_i - median(x)|
The method mad
in Pandas returns the mean absolute deviation instead.
Test:
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000]
stats.median_absolute_deviation(x, scale=1)
# 3.0
np.median(np.absolute(x - np.median(x)))
# 3.0
x = pd.Series(x)
(x - x.median()).abs().median()
# 3.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With