I am very new to Cython, yet am already experiencing extraordinary speedups just copying my .py
to .pyx
(and cimport cython
, numpy
etc) and importing into ipython3
with pyximport
.
Many tutorials start in this approach with the next step being to add cdef
declarations for every data type, which I can do for the iterators in my for loops etc.
But unlike most Pandas Cython tutorials or examples I am not apply functions so to speak, more manipulating data using slices, sums and division (etc).
So the question is: Can I increase the speed at which my code runs by stating that my DataFrame only contains floats (double
), with columns that are int
and rows that are int
?
How to define the type of an embedded list? i.e [[int,int],[int]]
Here is an example that generates the AIC score for a partitioning of a DF, sorry it is so verbose:
cimport cython
import numpy as np
cimport numpy as np
import pandas as pd
offcat = [
"breakingPeace",
"damage",
"deception",
"kill",
"miscellaneous",
"royalOffences",
"sexual",
"theft",
"violentTheft"
]
def partitionAIC(EmpFrame, part, OffenceEstimateFrame, ReturnDeathEstimate=False):
"""EmpFrame is DataFrame of ints, part is nested list of ints, OffenceEstimate frame is DF of float"""
"""partOf/block is a list of ints"""
"""ll, AIC, is series/frame of floats"""
##Cython cdefs
cdef int DFlen
cdef int puns
cdef int DeathPun
cdef int k
cdef int pId
cdef int punish
DFlen = EmpFrame.shape[1]
puns = 2
DeathPun = 0
PartitionModel = pd.DataFrame(index = EmpFrame.index, columns = EmpFrame.columns)
for partOf in part:
Grouping = [puns*x + y for x in partOf for y in list(range(0,puns))]
PartGroupSum = EmpFrame.iloc[:,Grouping].sum(axis=1)
for punish in range(0,puns):
PunishGroup = [x*puns+punish for x in partOf]
punishPunishment = ((EmpFrame.iloc[:,PunishGroup].sum(axis = 1) + 1/puns).div(PartGroupSum+1)).values[np.newaxis].T
PartitionModel.iloc[:,PunishGroup] = punishPunishment
PartitionModel = PartitionModel*OffenceEstimateFrame
if ReturnDeathEstimate:
DeathProbFrame = pd.DataFrame([[part]], index=EmpFrame.index, columns=['Partition'])
for pId,block in enumerate(part):
DeathProbFrame[pId] = PartitionModel.iloc[:,block[::puns]].sum(axis=1)
DeathProbFrame = DeathProbFrame.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[X] for X in x ]
for idx,x in enumerate(row['Partition'])],
key=lambda x: x[0], reverse=True),axis=1)
ll = (EmpFrame*np.log(PartitionModel.convert_objects(convert_numeric=True))).sum(axis=1)
k = (len(part))*(puns-1)
AIC = 2*k-2*ll
if ReturnDeathEstimate:
return AIC, DeathProbFrame
else:
return AIC
My advice is to do as much as possible in pandas. This is kinda standard advice "get it working first, then care about performance if it really matters". So let's suppose you've done that (hopefully you've written some tests too), and it's too slow:
Profile your code. (See this SO answer, or use %prun in ipython).
The output of prun should drive what bit to improve next.
Now, if it is a line to do with slicing (it probably isn't) put that tiny part in cython, I like to remove single python function calls to cython function. On that point stuff with cython should use numpy not pandas, I don't think pandas is not going to lower to C (cython can't infer types).
Putting your entire code into cython won't actually help that much, you want to only put the specific lines, or function calls, which are performance sensitive. Keeping cython focussed is the only way to have a good time.
Read the enhancing performance section of the pandas docs*! Here this process (prun -> cythonize -> type) is gone over step-by-step with a real-life example.
*Full-disclose I wrote it that section of the docs! :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With