My problem: I've found that processing large data sets with raw C++ using the STL map and vector can often be considerably faster (and with lower memory footprint) than using Cython.
I figure that part of this speed penalty is due to using Python lists and dicts, and that there might be some tricks to use less encumbered data structures in Cython. For example, this page (http://wiki.cython.org/tutorials/numpy) shows how to make numpy arrays very fast in Cython by predefining the size and types of the ND array.
Question: Is there any way to do something similar with lists/dicts, e.g. by stating roughly how many elements or (key,value) pairs you expect to have in them? That is, is there an idiomatic way to convert lists/dicts to (fast) data structures in Cython?
If not I guess I'll just have to write it in C++ and wrap in a Cython import.
Cython now has template support, and comes with declarations for some of the STL containers.
See http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#standard-library
Here's the example they give:
from libcpp.vector cimport vector
cdef vector[int] vect
cdef int i
for i in range(10):
vect.push_back(i)
for i in range(10):
print vect[i]
Doing similar operations in Python as in C++ can often be slower. list
and dict
are actually implemented very well, but you gain a lot of overhead using Python objects, which are more abstract than C++ objects and require a lot more lookup at runtime.
Incidentally, std::vector
is implemented in a pretty similar way to list
. std::map
, though, is actually implemented in a way that many operations are slower than dict
as its size gets large. For suitably large examples of each, dict
overcomes the constant factor by which it's slower than std::map
and will actually do operations like lookup, insertion, etc. faster.
If you want to use std::map
and std::vector
, nothing is stopping you. You'll have to wrap them yourself if you want to expose them to Python. Do not be shocked if this wrapping consumes all or much of the time you were hoping to save. I am not aware of any tools that make this automatic for you.
There are C API calls for controlling the creation of objects with some detail. You can say "Make a list with at least this many elements", but this doesn't improve the overall complexity of your list creation-and-filling operation. It certainly doesn't change much later as you try to change your list.
My general advice is
If you want a fixed-size array (you talk about specifying the size of a list), you may actually want something like a numpy array.
I doubt you are going to get any speedup you want out of using std::vector
over list
for a general replacement in your code. If you want to use it behind the scenes, it may give you a satisfying size and space improvement (I of course don't know without measuring, nor do you. ;) ).
dict
actually does its job really well. I definitely wouldn't try introducing a new general-purpose type for use in Python based on std::map
, which has worse algorithmic complexity in time for many important operations and—in at least some implementations—leaves some optimisations to the user that dict
already has.
If I did want something that worked a little more like std::map
, I'd probably use a database. This is generally what I do if stuff I want to store in a dict
(or for that matter, stuff I store in a list
) gets too big for me to feel comfortable storing in memory. Python has sqlite3
in the stdlib and drivers for all other major databases available.
C++ is fast not just because of the static declarations of the vector and the elements that go into it, but crucially because using templates/generics one specifies that the vector will only contain elements of a certain type, e.g. vector with tuples of three elements. Cython can't do this last thing and it sounds nontrivial -- it would have to be enforced at compile time, somehow (typechecking at runtime is what Python already does). So right now when you pop something off a list in Cython there is no way of knowing in advance what type it is , and putting it in a typed variable only adds a typecheck, not speed. This means that there is no way of bypassing the Python interpreter in this regard, and it seems to me it's the most crucial shortcoming of Cython for non-numerical tasks.
The manual way of solving this is to subclass the python list/dict (or perhaps std::vector) with a cdef class for a specific type of element or key-value combination. This would amount to the same thing as the code that templates are generating. As long as you use the resulting class in Cython code it should provide an improvement.
Using databases or arrays just solves a different problem, because this is about putting arbitrary objects (but with a specific type, and preferably a cdef class) in containers.
And std::map shouldn't be compared to dict; std::map maintains keys in sorted order because it is a balanced tree, dict solves a different problem. A better comparison would be dict and Google's hashtable.
You can take a look at the standard array
module for Python if this is appropriate for your Cython setting. I'm not sure since I have never used Cython.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With