Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Python is so slow for a simple for loop?

People also ask

Why does my loop slow down Python?

Memory - if you're storing all of the generated values somewhere, the loop might be slowing down because the of all the memory being used. Python has it's own memory manager, so using up lots of memory could eventually make the program slower.

How do you make a for loop faster in Python?

A faster way to loop using built-in functions A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers. The code above takes 0.84 seconds.

Is Python slow at loops?

Looping over Python arrays, lists, or dictionaries, can be slow. Thus, vectorized operations in Numpy are mapped to highly optimized C code, making them much faster than their standard Python counterparts.

Why is my for loop taking so long?

Growing variables in a loop takes very long. Each time you increase the length of the variable, a million times here, you force MATLAB to first create a variable with the initial length+1, then copy the contents, then delete the old variable. That's probably what is taking your code so long.


Why is Java faster than Python on this example loops?

Novice Explanation: Think of a program like a freight train that lays its own train-track as it moves forward. Track must be laid before the train can move. The Java Freight train can send thousands of track-layers ahead of the train, all working in parallel laying track many miles in advance, wheras python can only send one laboror at a time, and can only lay track 10 feet in front of where the train is.

Java has strong types and that affords the compiler to use JIT features: (https://en.wikipedia.org/wiki/Just-in-time_compilation) which enable the CPU to fetch memory and execute instructions in the future in parallel, before the instruction is needed. Java can 'sort of' run the instructions in your for loop in parallel with itself. Python has no concrete types and so the nature of the work to be done has to be decided at every instruction. This causes your entire computer to stop and wait for all the memory in all of your variables to be re-scanned. Meaning loops in python are polynomial O(n^2) time, wheras Java loops can be, and often are linear time O(n), due to strong types.

I think I am making mistakes because I know Python is used by lots of scientific projects.

They're heavily using SciPy (NumPy being the most prominent component, but I've heard the ecosystem that developed around NumPy's API is even more important) which vastly speeds up all kinds operations these projects need. There's what you are doing wrong: You aren't writing your critical code in C. Python is great for developing in general, but well-placed extension modules are a vital optimization in its own right (at least when you're crunching numbers). Python is a really crappy language to implement tight inner loops in.

The default (and for the time being most popular and widely-supported) implementation is a simple bytecode interpreter. Even the simplest operations, like an integer division, can take hundreds of CPU cycles, multiple memory accesses (type checks being a popular example), several C function calls, etc. instead of a few (or even single, in the case of integer division) instruction. Moreover, the language is designed with many abstractions which add overhead. Your loop allocates 9999 objects on the heap if you use xrange - far more if you use range (99999999 integer minus around 256256 for small integers which are cached). Also, the xrange version calls a method on each iteration to advance - the range version would too if iteration over sequences hadn't been optimized specifically. It still takes a whole bytecode dispatch though, which is itself vastly complex (compared to an integer division, of course).

It would be interesting to see what a JIT (I'd recommend PyPy over Psyco, the latter isn't actively developed anymore and very limited in scope anyway - it might work well for this simple example though). After a tiny fraction of iterations, it should produce a nigh-optimal machine code loop augmented with a few guards - simple integer comparisions, jumping if they fail - to maintain correctness in case you got a string in that list. Java can do the same thing, only sooner (it doesn't have to trace first) and with fewer guards (at least if you use ints). That's why it's so much faster.


Because you mention scientific code, have a look at numpy. What you're doing has probably already been done (or rather, it uses LAPACK for things like SVD). When you hear about python being used for scientific code, people probably aren't referring to using it in the way you do in your example.

As a quick example:

(If you're using python3, your example would use float division. My example assumes you're using python2.x, and therefore integer division. If not, specify i = np.arange(9999, dtype=np.float), etc)

import numpy as np
i = np.arange(9999)
j = np.arange(1, 9999)
print np.divide.outer(i,j).sum()

To give some idea of timing... (I'll use floating point division here, instead of integer division as in your example):

import numpy as np

def f1(num):
    total = 0.0
    for i in range(num): 
        for j in range(1, num):
            total += (float(i) / j)
    return total

def f2(num):
    i = np.arange(num, dtype=np.float)
    j = np.arange(1, num, dtype=np.float)
    return np.divide.outer(i, j).sum()

def f3(num):
    """Less memory-hungry (and faster) version of f2."""
    total = 0.0
    j = np.arange(1, num, dtype=np.float)
    for i in xrange(num):
        total += (i / j).sum()
    return total

If we compare timings:

In [30]: %timeit f1(9999)
1 loops, best of 3: 27.2 s per loop

In [31]: %timeit f2(9999)
1 loops, best of 3: 1.46 s per loop

In [32]: %timeit f3(9999)
1 loops, best of 3: 915 ms per loop

I think NumPy can be faster than CPython for loops (I didn't test in PyPy).

I want to start from Joe Kington's code because this answer used NumPy.

%timeit f3(9999)
704 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

by myself:

def f4(num):
    x=np.ones(num-1)
    y=np.arange(1,num)
    return np.sum(np.true_divide(x,y))*np.sum(y)

155 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In addition, High School Mathematics can simplify the problem to computer.

Problem= (1+2+...+(num-1)) * (1/1+1/2+...+1/(num-1))
1+2+...+(num-1)=np.sum(np.arange(1,num))=num*(num-1)/2
1/1+1/2+...+1/(num-1)=np.true_divide (1,y)=np.reciprocal(y.astype(np.float64))

Therefore,

def f5(num):
    return np.sum(np.reciprocal(np.arange(1, num).astype(np.float64))) * num*(num-1)/2
%timeit f5(9999)
106 µs ± 615 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In addition, University Mathematics can simplify the problem to computer more.

1/1+1/2+...+1/(num-1)=np.log(num-1)+1/(2*num-2)+np.euler_gamma
(n>2)

np.euler_gamma: Euler-Mascheroni constant (0.57721566...)

Because of inaccuracy of Euler-Mascheroni constant in NumPy, You lose accuracy like 489223499.9991845 -> 489223500.0408554. If You can ignore 0.0000000085% inaccuracy, You can save more time.

def f6(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
%timeit f6(9999)
4.82 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Benefit of NumPy becomes larger with larger input.

%timeit f3(99999)
56.7 s ± 590 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f5(99999)
534 µs ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f5(99999999)
1.42 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
9.498947911958**416**e+16
%timeit f6(99999999)
4.88 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.498947911958**506**e+16
%timeit f6(9999999999999999999)
17.9 µs ± 921 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In special case, You can use numba (Unfortunately not always).

from numba import jit
@jit
def f7(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
# same code with f6(num)

%timeit f6(999999999999999)
5.63 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
f7(123) # compile f7(num)
%timeit f7(999999999999999)
331 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit f7(9999)
286 ns ± 3.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So, I recommend to use NumPy, mathematics and numba together.


Python for loops are statically typed and interpreted. Not compiled. Java is faster because it has extra JIT acceleration features that Python does not have.

http://en.wikipedia.org/wiki/Just-in-time_compilation

To illustrate just how massive a difference Java JIT makes, look at this python program that takes about 5 minutes:

if __name__ =='__main__':
    total = 0.0
    i=1
    while i<=9999:
        j=1
        while j<=9999:
            total=1
            j+=1
        i+=1
    print total

While this fundamentally equivalent Java program takes about 23 milliseconds:

public class Main{
    public static void main(String args[]){
        float total = 0f; 

        long start_time = System.nanoTime();
        int i=1;

        while (i<=9999){
            int j=1;
            while(j<=9999){
                total+=1;
                j+=1;
            }
            i+=1;
        }
        long end_time = System.nanoTime();

        System.out.println("total: " + total);
        System.out.println("total milliseconds: " + 
           (end_time - start_time)/1000000);
    }
}

In terms of doing anything in a for loop, Java cleans python's clock by being between 1 and 1000 orders of magnitude faster.

Moral of the story: basic python for loops should be avoided at all costs if speedy performance is required. This could be because Guido van Rossum wants to encourage people to use multi-processor friendly constructs like array splicing, which operate faster than Java.


The benefit of Python is that there is a lot more flexibility (e.g. classes are objects) compared to Java (where you only have this reflection mechanism)

What's not mentioned here is Cython. It allows to introduce typed variables and trans-compile your example to C/C++. Then it's much faster. I've also changed the bounds in the loop ...

from __future__ import division

cdef double total = 0.00
cdef int i, j
for i in range(9999):
    for j in range(1, 10000+i):
        total += (i / j)

from time import time
t = time()
print("total = %d" % total)
print("time = %f[s]" % (time() - t))

Followed by

$ cython loops.pyx
$ gcc -I/usr/include/python2.7 -shared -pthread -fPIC -fwrapv -Wall -fno-strict-aliasing -O3 -o loops.so loops.c
$ python -c "import loops"

gives

total = 514219068
time = 0.000047[s]

This is a known phenomenon -- python code is dynamic and interpreted, java code is statically typed and compiled. No surprises there.

The reasons people give for preferring python are often:

  • smaller code base
  • less redundancy (more DRY)
  • cleaner code

However, if you use a library written in C (from python), the performance may be much better (compare: pickle to cpickle).