Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python vs MATLAB performance on algorithm

I have a performance question about two bits of code. One is implemented in python and one in MATLAB. The code calculates the sample entropy of a time series (which sounds complicated but is basically a bunch of for loops).

I am running both implementations on relatively large time series (~95k+ samples) depending on the time series. The MATLAB implementation finishes the calculation in ~45 sec to 1 min. The python one basically never finishes. I threw tqdm over the python for loops and the upper loop was only moving at about ~1.85s/it which gives 50+ hours as a estimated completion time (I've let it run for 15+ mins and the iteration count was pretty consistent).

Example inputs and runtimes:

MATLAB (~ 52 sec):

a = rand(1, 95000)
sampenc(a, 4, 0.1 * std(a))

Python (currently 5 mins in with 49 hours estimated):

import numpy as np
a = np.random.rand(1, 95000)[0]
sample_entropy(a, 4, 0.1 * np.std(a))

Python Implementation:

# https://github.com/nikdon/pyEntropy
def sample_entropy(time_series, sample_length, tolerance=None):
    """Calculate and return Sample Entropy of the given time series.
    Distance between two vectors defined as Euclidean distance and can
    be changed in future releases
    Args:
        time_series: Vector or string of the sample data
        sample_length: Number of sequential points of the time series
        tolerance: Tolerance (default = 0.1...0.2 * std(time_series))
    Returns:
        Vector containing Sample Entropy (float)
    References:
        [1] http://en.wikipedia.org/wiki/Sample_Entropy
        [2] http://physionet.incor.usp.br/physiotools/sampen/
        [3] Madalena Costa, Ary Goldberger, CK Peng. Multiscale entropy analysis
            of biological signals
    """
    if tolerance is None:
        tolerance = 0.1 * np.std(time_series)

    n = len(time_series)
    prev = np.zeros(n)
    curr = np.zeros(n)
    A = np.zeros((sample_length, 1))  # number of matches for m = [1,...,template_length - 1]
    B = np.zeros((sample_length, 1))  # number of matches for m = [1,...,template_length]

    for i in range(n - 1):
        nj = n - i - 1
        ts1 = time_series[i]
        for jj in range(nj):
            j = jj + i + 1
            if abs(time_series[j] - ts1) < tolerance:  # distance between two vectors
                curr[jj] = prev[jj] + 1
                temp_ts_length = min(sample_length, curr[jj])
                for m in range(int(temp_ts_length)):
                    A[m] += 1
                    if j < n - 1:
                        B[m] += 1
            else:
                curr[jj] = 0
        for j in range(nj):
            prev[j] = curr[j]

    N = n * (n - 1) / 2
    B = np.vstack(([N], B[:sample_length - 1]))
    similarity_ratio = A / B
    se = - np.log(similarity_ratio)
    se = np.reshape(se, -1)
    return se

MATLAB Implementation:

function [e,A,B]=sampenc(y,M,r);
%function [e,A,B]=sampenc(y,M,r);
%
%Input
%
%y input data
%M maximum template length
%r matching tolerance
%
%Output
%
%e sample entropy estimates for m=0,1,...,M-1
%A number of matches for m=1,...,M
%B number of matches for m=0,...,M-1 excluding last point

n=length(y);
lastrun=zeros(1,n);
run=zeros(1,n);
A=zeros(M,1);
B=zeros(M,1);
p=zeros(M,1);
e=zeros(M,1);

for i=1:(n-1)
   nj=n-i;
   y1=y(i);
   for jj=1:nj
      j=jj+i;      
      if abs(y(j)-y1)<r
         run(jj)=lastrun(jj)+1;
         M1=min(M,run(jj));
         for m=1:M1           
            A(m)=A(m)+1;
            if j<n
               B(m)=B(m)+1;
            end            
         end
      else
         run(jj)=0;
      end      
   end
   for j=1:nj
      lastrun(j)=run(j);
   end
end
N=n*(n-1)/2;
B=[N;B(1:(M-1))];
p=A./B;
e=-log(p);

I've also tried a few other python implementations and all of them have the same slow result: vectorized-sample-entropy

sampen

sampen2.py

Wikipedia sample entropy implementation

I don't think computer issue as it runs relativity fast in MATLAB.

As far as I can tell, implementation-wise both sets of code are the same. I have no idea why the python implementations are so slow. I would understand a difference of a few seconds but not such a large discrepancy. Let me know your thoughts on why this is or suggestions on how to improve the python versions.

BTW: I'm using Python 3.6.5 with numpy 1.14.5 and MATLAB R2018a.

like image 398
Daniel Wendelken Avatar asked Feb 13 '26 07:02

Daniel Wendelken


1 Answers

As said in the comments, Matlab uses a jit-compiler by default Python doesn't. In Python you could use Numba to do quite the same.

Your code with slight modifications

import numba as nb
import numpy as np
import time

@nb.jit(fastmath=True,error_model='numpy')
def sample_entropy(time_series, sample_length, tolerance=None):
    """Calculate and return Sample Entropy of the given time series.
    Distance between two vectors defined as Euclidean distance and can
    be changed in future releases
    Args:
        time_series: Vector or string of the sample data
        sample_length: Number of sequential points of the time series
        tolerance: Tolerance (default = 0.1...0.2 * std(time_series))
    Returns:
        Vector containing Sample Entropy (float)
    References:
        [1] http://en.wikipedia.org/wiki/Sample_Entropy
        [2] http://physionet.incor.usp.br/physiotools/sampen/
        [3] Madalena Costa, Ary Goldberger, CK Peng. Multiscale entropy analysis
            of biological signals
    """
    if tolerance is None:
        tolerance = 0.1 * np.std(time_series)

    n = len(time_series)
    prev = np.zeros(n)
    curr = np.zeros(n)
    A = np.zeros((sample_length))  # number of matches for m = [1,...,template_length - 1]
    B = np.zeros((sample_length))  # number of matches for m = [1,...,template_length]

    for i in range(n - 1):
        nj = n - i - 1
        ts1 = time_series[i]
        for jj in range(nj):
            j = jj + i + 1
            if abs(time_series[j] - ts1) < tolerance:  # distance between two vectors
                curr[jj] = prev[jj] + 1
                temp_ts_length = min(sample_length, curr[jj])
                for m in range(int(temp_ts_length)):
                    A[m] += 1
                    if j < n - 1:
                        B[m] += 1
            else:
                curr[jj] = 0
        for j in range(nj):
            prev[j] = curr[j]

    N = n * (n - 1) // 2

    B2=np.empty(sample_length)
    B2[0]=N
    B2[1:]=B[:sample_length - 1]
    similarity_ratio = A / B2
    se = - np.log(similarity_ratio)
    return se

Timings

a = np.random.rand(1, 95000)[0] #Python
a = rand(1, 95000) #Matlab
Python 3.6, Numba 0.40dev, Matlab 2016b, Core i5-3210M

Python:       487s
Python+Numba: 12.2s
Matlab:       71.1s
like image 177
max9111 Avatar answered Feb 15 '26 21:02

max9111



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!