I have a user-own metric to implement as follows: <pre class="prettyprint"><code>def metric(pred:pd.DataFrame(), valid:pd.DataFrame()): date_begin = valid.dt.min() date_end = valid.dt.max() x = valid[valid.label == 1].dt.min() # p p_n_tpp_df = valid[(valid.dt >= x) &\ (valid.dt <= x + timedelta(days=30)) &\ (p_n_tpp_df.label == 1)] p_n_pp_df = valid[(valid.dt >= date_begin + timedelta(days=30)) &\ (valid.dt <= date_end + timedelta(days=30)) &\ (p_n_tpp_df.label == 1)] p_n_tpp = len([x for x in pred.serial_number.values\ if x in p_n_tpp_df.serial_number.unique()]) p_n_pp = len([x for x in pred.serial_number.values\ if x in p_n_pp_df.serial_number.unique()]) p = p_n_tpp / p_n_pp print('p: ', p) # r p_n_tpr_df = valid[(valid.dt >= date_begin - timedelta(days=30)) &\ (valid.dt <= date_end - timedelta(days=30)) &\ (p_n_tpr_df.label == 1)] p_n_pr_df = valid[(valid.dt >= date_begin) &\ (valid.dt <= date_end) &\ (p_n_pr_df.label == 1)] p_n_tpr = len([x for x in pred.serial_number.values\ if x in p_n_tpr_df.serial_number.unique()]) p_n_pr = len([x for x in pred.serial_number.values\ if x in p_n_pr_df.serial_number.unique()]) r = p_n_tpr / p_n_pr print('p: ', r) m = 2 * p * r / (p + r) return m </code></pre> The <code>pd.DataFrame()</code> of <code>pred</code> and <code>valid</code> have the same columns and <code>dt</code> has no intersections. And the all the values of <code>serial_number</code> in <code>valid</code> is a subset of all the values of <code>serial_number</code> in <code>pred</code>. The <code>label</code> column only has 2 values: 0 or 1. Here is the sample of <code>pred</code> and <code>valid</code> is as follows: <pre class="prettyprint"><code> print(pred.head(3)) serial_number dt label 0 123 2011-03-21 1 1 52 2011-03-22 0 2 12 2011-03-01 1 ..., ... print(pred.info()) Int64Index: 10000000 entries, Data columns (total 3 columns): serial_number int32 dt datetimes64[ns] label int8 ..., ... print(valid.head(3)) serial_number dt label 0 324 2011-04-22 1 1 52 2011-04-22 0 2 14 2011-04-01 1 ..., ... print(valid.info()) Int64Index: 10000000 entries, Data columns (total 3 columns): serial_number int32 dt datetimes64[ns] label int8 </code></pre> And the size of input <code>pd.DataFrame</code> is about 10, 000, 000 samples and 3 features. When I try to use it to calculate this metric, it is really slow and time spending is more than 2 hours on Intel 9600KF. So I am wondering how to optimize such code on time cost. Thanks in advance.

Here is the biggest performance win in the code that you have: <h3>Numpy set logic</h3> <pre class="prettyprint"><code>len([x for x in pred.serial_number.values\ if x in p_n_tpr_df.serial_number.unique()]) </code></pre> Any line that looks like this is getting the size of the set intersection of <code>pred.serial_number</code> and <code>p_n_tpr_df.serial_number</code>. Using numpy rather than the list comprehension and the <code>unique</code> call will save substantial compute time: <pre class="prettyprint"><code>intersect_size = np.intersect1d(pred.serial_number.values, p_n_tpr_df.serial_number.values).shape[0] </code></pre>

How to optimize such codes as follows in python?

Tags:

python

pandas

numpy

I have a user-own metric to implement as follows:

def metric(pred:pd.DataFrame(), valid:pd.DataFrame()):
    date_begin = valid.dt.min()
    date_end = valid.dt.max()
    x = valid[valid.label == 1].dt.min()

    # p
    p_n_tpp_df = valid[(valid.dt >= x) &\
                       (valid.dt <= x + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]
    p_n_pp_df =  valid[(valid.dt >= date_begin + timedelta(days=30)) &\ 
                       (valid.dt <= date_end + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]


    p_n_tpp = len([x for x in pred.serial_number.values\ 
                     if x in p_n_tpp_df.serial_number.unique()])
    p_n_pp = len([x for x in pred.serial_number.values\ 
                    if x in p_n_pp_df.serial_number.unique()])

    p = p_n_tpp / p_n_pp
    print('p: ', p)

    # r
    p_n_tpr_df = valid[(valid.dt >= date_begin - timedelta(days=30)) &\ 
                      (valid.dt <= date_end - timedelta(days=30)) &\
                      (p_n_tpr_df.label == 1)]
    p_n_pr_df = valid[(valid.dt >= date_begin) &\ 
                      (valid.dt <= date_end) &\ 
                      (p_n_pr_df.label == 1)]


    p_n_tpr = len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])
    p_n_pr = len([x for x in pred.serial_number.values\
                    if x in p_n_pr_df.serial_number.unique()])

    r = p_n_tpr / p_n_pr
    print('p: ', r)

    m = 2 * p * r / (p + r)

    return m

The pd.DataFrame() of pred and valid have the same columns and dt has no intersections.
And the all the values of serial_number in valid is a subset of all the values of serial_number in pred.
The label column only has 2 values: 0 or 1.
Here is the sample of pred and valid is as follows:


print(pred.head(3))
    serial_number  dt          label  
0   123            2011-03-21  1
1   52             2011-03-22  0
2   12             2011-03-01  1
..., ...


print(pred.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8
..., ...

print(valid.head(3))
    serial_number  dt          label  
0   324            2011-04-22  1
1   52             2011-04-22  0
2   14             2011-04-01  1
..., ...


print(valid.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8

And the size of input pd.DataFrame is about 10, 000, 000 samples and 3 features.
When I try to use it to calculate this metric, it is really slow and time spending is more than 2 hours on Intel 9600KF.
So I am wondering how to optimize such code on time cost.
Thanks in advance.

550

asked Mar 05 '20 20:03

Bowen Peng

1 Answers

Here is the biggest performance win in the code that you have:

Numpy set logic

len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])

Any line that looks like this is getting the size of the set intersection of pred.serial_number and p_n_tpr_df.serial_number. Using numpy rather than the list comprehension and the unique call will save substantial compute time:

intersect_size = np.intersect1d(pred.serial_number.values,
                                p_n_tpr_df.serial_number.values).shape[0]

answered Oct 06 '22 18:10

hume

Related questions
                            
                                IPython `display` to string
                            
                                How to plot a list of Shapely points
                            
                                Why compare two strings via calculating xor of their characters?
                            
                                Amazon AWS Kinesis Video Boto GetMedia/PutMedia
                            
                                Python override 3rd party package single file
                            
                                Dango 2.2 Reverse for 'activate' with keyword arguments
                            
                                Weird Issue when using dataclass and property together
                            
                                How to enable zoom in/out and zoom to percentage buttons in Plots pane in Spyder?
                            
                                Copy type signature from another function
                            
                                Select NumPy Values Around Index
                            
                                How to display and edit all Jupyter shortcuts in vscode (similar to typical `jupyter-notebook`)?
                            
                                Why are all labels_ are -1? Generated by DBSCAN in Python
                            
                                What is the most pythonic way of generating a boolean mask of an RGB image based on the colour of the pixels?
                            
                                What is the numpy equivalent of random.sample?
                            
                                Difference between isin, str.contains and if condition?
                            
                                Django-import-export post_save called twice
                            
                                Get notifications when active X window changes using Python xlib
                            
                                Accessing Google cloud API from local Project not Hosted on Google cloud platform
                            
                                How to converting string to list?
                            
                                Loading Custom Model with Tensorflow 2.1

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With