My Python program was too slow. So, I profiled it and found that most of the time was being spent in a function that computes distance between two points (a point is a list of 3 Python floats):
def get_dist(pt0, pt1):
val = 0
for i in range(3):
val += (pt0[i] - pt1[i]) ** 2
val = math.sqrt(val)
return val
To analyze why this function was so slow, I wrote two test programs: one in Python and one in C++ that do similar computation. They compute the distance between 1 million pairs of points. (The test code in Python and C++ is below.)
The Python computation takes 2 seconds, while C++ takes 0.02 seconds. A 100x difference!
Why is Python code so much slower than C++ code for such simple math computations? How do I speed it up to match the C++ performance?
The Python code used for testing:
import math, random, time
num = 1000000
# Generate random points and numbers
pt_list = []
rand_list = []
for i in range(num):
pt = []
for j in range(3):
pt.append(random.random())
pt_list.append(pt)
rand_list.append(random.randint(0, num - 1))
# Compute
beg_time = time.clock()
dist = 0
for i in range(num):
pt0 = pt_list[i]
ri = rand_list[i]
pt1 = pt_list[ri]
val = 0
for j in range(3):
val += (pt0[j] - pt1[j]) ** 2
val = math.sqrt(val)
dist += val
end_time = time.clock()
elap_time = (end_time - beg_time)
print elap_time
print dist
The C++ code used for testing:
#include <cstdlib>
#include <iostream>
#include <ctime>
#include <cmath>
struct Point
{
double v[3];
};
int num = 1000000;
int main()
{
// Allocate memory
Point** pt_list = new Point*[num];
int* rand_list = new int[num];
// Generate random points and numbers
for ( int i = 0; i < num; ++i )
{
Point* pt = new Point;
for ( int j = 0; j < 3; ++j )
{
const double r = (double) rand() / (double) RAND_MAX;
pt->v[j] = r;
}
pt_list[i] = pt;
rand_list[i] = rand() % num;
}
// Compute
clock_t beg_time = clock();
double dist = 0;
for ( int i = 0; i < num; ++i )
{
const Point* pt0 = pt_list[i];
int r = rand_list[i];
const Point* pt1 = pt_list[r];
double val = 0;
for ( int j = 0; j < 3; ++j )
{
const double d = pt0->v[j] - pt1->v[j];
val += ( d * d );
}
val = sqrt(val);
dist += val;
}
clock_t end_time = clock();
double sec_time = (end_time - beg_time) / (double) CLOCKS_PER_SEC;
std::cout << sec_time << std::endl;
std::cout << dist << std::endl;
return 0;
}
The code is interpreted which introduces overhead on every operation. Not only that but Python can't assume the objects are numbers, it must inspect them to even know which operation to perform. For example you can use + on two numbers to add them, or you can use + on two strings to concatenate them.
The math. dist() method returns the Euclidean distance between two points (p and q), where p and q are the coordinates of that point.
"Python is widely acknowledged as slow.
To calculate the distance between two points in 3D in Python, use the math. dist() method.
A sequence of optimizations:
import math, random, time
num = 1000000
# Generate random points and numbers
# Change #1: Sometimes it's good not to have too much randomness.
# This is one of those cases.
# Changing the code shouldn't change the results.
# Using a fixed seed ensures that the changes are valid.
# The final 'print dist' should yield the same result regardless of optimizations.
# Note: There's nothing magical about this seed.
# I randomly picked a hash tag from a git log.
random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8)
pt_list = []
rand_list = []
for i in range(num):
pt = []
for j in range(3):
pt.append(random.random())
pt_list.append(pt)
# Change #2: rand_list is computed in a separate loop.
# This ensures that upcoming optimizations will get the same results as
# this unoptimized version.
for i in range(num):
rand_list.append(random.randint(0, num - 1))
# Compute
beg_time = time.clock()
dist = 0
for i in range(num):
pt0 = pt_list[i]
ri = rand_list[i]
pt1 = pt_list[ri]
val = 0
for j in range(3):
val += (pt0[j] - pt1[j]) ** 2
val = math.sqrt(val)
dist += val
end_time = time.clock()
elap_time = (end_time - beg_time)
print elap_time
print dist
The first optimization (not shown) is to embed all of the code except the import
in a function. This simple change offers a 36% performance boost on my computer.
**
operator.You don't use pow(d,2)
in your C code because everyone knows that this is suboptimal in C. It's also suboptimal in python. Python's **
is smart; it evaluates x**2
as x*x
. However, it takes time to be smart. You know you want d*d
, so use it. Here's the computation loop with that optimization:
for i in range(num):
pt0 = pt_list[i]
ri = rand_list[i]
pt1 = pt_list[ri]
val = 0
for j in range(3):
d = pt0[j] - pt1[j]
val += d*d
val = math.sqrt(val)
dist += val
Your Python code looks a whole lot like your C code. You aren't taking advantage of the language.
import math, random, time, itertools
def main (num=1000000) :
# This small optimization speeds things up by a couple percent.
sqrt = math.sqrt
# Generate random points and numbers
random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8)
def random_point () :
return [random.random(), random.random(), random.random()]
def random_index () :
return random.randint(0, num-1)
# Big optimization:
# Don't generate the lists of points.
# Instead use list comprehensions that create iterators.
# It's best to avoid creating lists of millions of entities when you don't
# need those lists. You don't need the lists; you just need the iterators.
pt_list = [random_point() for i in xrange(num)]
rand_pts = [pt_list[random_index()] for i in xrange(num)]
# Compute
beg_time = time.clock()
dist = 0
# Don't loop over a range. That's too C-like.
# Instead loop over some iterable, preferably one that doesn't create the
# collection over which the iteration is to occur.
# This is particularly important when the collection is large.
for (pt0, pt1) in itertools.izip (pt_list, rand_pts) :
# Small optimization: inner loop inlined,
# intermediate variable 'val' eliminated.
d0 = pt0[0]-pt1[0]
d1 = pt0[1]-pt1[1]
d2 = pt0[2]-pt1[2]
dist += sqrt(d0*d0 + d1*d1 + d2*d2)
end_time = time.clock()
elap_time = (end_time - beg_time)
print elap_time
print dist
The following takes about 1/40th the time of the original version in the timed section of the code. Not quite as fast as C, but close.
Note the commented out, "Mondo slow" computation. That takes about ten times as long as the original version. There is an overhead cost with using numpy. The setup takes quite a bit longer in the code that follows compared to that in my non-numpy optimization #3.
Bottom line: You need to take care when using numpy, and the setup costs might be significant.
import numpy, random, time
def main (num=1000000) :
# Generate random points and numbers
random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8)
def random_point () :
return [random.random(), random.random(), random.random()]
def random_index () :
return random.randint(0, num-1)
pt_list = numpy.array([random_point() for i in xrange(num)])
rand_pts = pt_list[[random_index() for i in xrange(num)],:]
# Compute
beg_time = time.clock()
# Mondo slow.
# dist = numpy.sum (
# numpy.apply_along_axis (
# numpy.linalg.norm, 1, pt_list - rand_pts))
# Mondo fast.
dist = numpy.sum ((numpy.sum ((pt_list-rand_pts)**2, axis=1))**0.5)
end_time = time.clock()
elap_time = (end_time - beg_time)
print elap_time
print dist
Some general hints:
Move all your code into a main() function and use the normal
if __name__ == "__main__":
main()
construct. It greatly improves speed due to variable-scoping. See Why does Python code run faster in a function? for an explanation of why.
Don't use range()
since it generates the complete range at once which is slow for large numbers; instead use xrange()
which uses a generator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With