Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up pairing of strings into objects in Python

When dealing with the creating of large numbers of objects, often the single biggest performance enhancement you can use is to turn the garbage collector off. Every "generation" of objects, the garbage collector traverses all the live objects in memory, looking for objects that are a part of cycles but are not pointed at by live objects, thus are eligible for memory reclamation. See Doug Helmann's PyMOTW GC article for some information (more can perhaps be found with google and some determination). The garbage collector is run by default every 700 or so objects created-but-not-reclaimed, with subsequent generations running somewhat less often (I forget the exact details).

Using a standard tuple instead of a Point class can save you some time (using a namedtuple is somewhere in between), and clever unpacking can save you some time, but the largest gain can be had by turning the gc off before your creation of lots of objects that you know don't need to be gc'd, and then turning it back on afterwards.

Some code:

def orig_test_gc_off():
    import time
    data = get_data()
    all_point_sets = []
    import gc
    gc.disable()
    time_start = time.time()
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "gc off total time: ", (time_end - time_start)

def test1():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    gc.disable()
    for index, row in enumerate(data):
        first_points, second_points = row
        curr_points = map(
            Point,
            [int(i) for i in first_points.split(",")],
            [int(i) for i in second_points.split(",")])
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 1 total time: ", (time_end - time_start)

def test2():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    gc.disable()
    time_start = time.time()
    for index, row in enumerate(data):
        first_points, second_points = row
        first_points = [int(i) for i in first_points.split(",")]
        second_points = [int(i) for i in second_points.split(",")]
        curr_points = [(x, y) for x, y in zip(first_points, second_points)]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 2 total time: ", (time_end - time_start)

orig_test()
orig_test_gc_off()
test1()
test2()

Some results:

>>> %run /tmp/flup.py
total time:  6.90738511086
gc off total time:  4.94075202942
variant 1 total time:  4.41632509232
variant 2 total time:  3.23905301094

Simply running with pypy makes a big difference

$ python pairing_strings.py 
total time:  2.09194397926
$ pypy pairing_strings.py 
total time:  0.764246940613

disable gc didn't help for pypy

$ pypy pairing_strings.py 
total time:  0.763386964798

namedtuple for Point makes it worse

$ pypy pairing_strings.py 
total time:  0.888827085495

using itertools.imap, and itertools.izip

$ pypy pairing_strings.py 
total time:  0.615751981735

Using a memoized version of int and an iterator to avoid the zip

$ pypy pairing_strings.py 
total time:  0.423738002777 

Here is the code I finished with.

def test():
    import time
    def m_int(s, memo={}):
        if s in memo:
            return memo[s]
        else:
            retval = memo[s] = int(s)
            return retval
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    for xs, ys in data:
        point_set = []
        # Convert points from strings to integers
        y_iter = iter(ys.split(","))
        curr_points = [Point(m_int(i), m_int(next(y_iter))) for i in xs.split(",")]
        all_point_sets.append(curr_points)
    time_end = time.time()
    print "total time: ", (time_end - time_start)

I would

  • use numpy arrays for this problem (Cython would be an option, if this is still not fast enough).
  • store the points as a vector not as single Point instances.
  • rely on existing parsers
  • (if possible) parse the data once and than store it in a binary format like hdf5 for further calculations, which will be the fastest option (see below)

Numpy has built in functions to read text files, for instance loadtxt. If you have the data stored in a structured array, you do not necessarily need to convert it to another data type. I will use Pandas which is a library build on top of numpy. It is a bit more convenient for handling and processing structured data. Pandas has its own file parser read_csv.

To time it, I wrote the data to a file, like in your original problem (it is based on your get_data):

import numpy as np
import pandas as pd

def create_example_file(n=100000, m=20):
    ex1 = pd.DataFrame(np.random.randint(1, 10, size=(10e4, m)),
                       columns=(['x_%d' % x for x in range(10)] +
                                ['y_%d' % y for y in range(10)]))
    ex1.to_csv('example.csv', index=False, header=False)
    return

This is the code I used to read the data in a pandas.DataFrame:

def with_read_csv(csv_file):
    df = pd.read_csv(csv_file, header=None,
                     names=(['x_%d' % x for x in range(10)] +
                            ['y_%d' % y for y in range(10)]))
    return df

(Note that I assumed, that there is no header in your file and so I had to create the column names.)

Reading the data is fast, it should be more memory efficient (see this question) and the data is stored in a data structure you can further work with in a fast, vectorized way:

In [18]: %timeit string_to_object.with_read_csv('example.csv')
1 loops, best of 3: 553 ms per loop

There is a new C based parser in an development branch which takes 414 ms on my system. Your test takes 2.29 s on my system, but it is not really comparable, as the data is not read from a file and you created the Point instances.

If you have once read in the data you can store it in a hdf5 file:

In [19]: store = pd.HDFStore('example.h5')

In [20]: store['data'] = df

In [21]: store.close()

Next time you need the data you can read it from this file, which is really fast:

In [1]: store = pd.HDFStore('example.h5')

In [2]: %timeit df = store['data']
100 loops, best of 3: 16.5 ms per loop

However it will only be applicable, if you need the same data more than one time.

Using numpy based arrays with large data sets will have advantages when you are doing further calculations. Cython wouldn't necessarily be faster if you can use vectorized numpy functions and indexing, it will be faster if you really need iteration (see also this answer).


Faster method, using Numpy (speedup of about 7x):

import numpy as np
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))

Performance comparison:

def load_1(data):
    all_point_sets = []
    gc.disable()
    for xs, ys in data:
        all_point_sets.append(zip(map(int, xs.split(',')), map(int, ys.split(','))))
    gc.enable()
    return all_point_sets

def load_2(data):
    txt = ','.join(','.join(row) for row in data)
    arr = np.fromstring(txt, dtype=int, sep=',')
    return arr.reshape(100000, 2, 10).transpose((0,2,1))

load_1 runs in 1.52 seconds on my machine; load_2 runs in 0.20 seconds, a 7-fold improvement. The big caveat here is that it requires that you (1) know the lengths of everything in advance, and (2) that every row contains the exact same number of points. This is true for your get_data output, but may not be true for your real dataset.


I got a 50% improvement by using arrays, and a holder object that lazily constructs Point objects when accessed. I also "slotted" the Point object for better storage efficiency. However, a tuple would probably be better.

Changing the data structure may also help, if that's possible. But this will never be instantaneous.

from array import array

class Point(object):
    __slots__ = ["a", "b"]
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __repr__(self):
        return "Point(%d, %d)" % (self.a, self.b)

class Points(object):
    def __init__(self, xs, ys):
        self.xs = xs
        self.ys = ys

    def __getitem__(self, i):
        return Point(self.xs[i], self.ys[i])

def test3():
    xs = array("i")
    ys = array("i")
    time_start = time.time()
    for row in data:
        xs.extend([int(val) for val in row[0].split(",")])
        ys.extend([int(val) for val in row[1].split(",")])
    print ("total time: ", (time.time() - time_start))
    return Points(xs, ys)

But when dealing with large amounts of data I would usually use numpy N dimensional arrays (ndarray). If the original data structure could be altered then that would likely be fastest of all. If it could be structured to read x,y pairs in linearly and then reshape the ndarray.