I have a user-own metric to implement as follows:
def metric(pred:pd.DataFrame(), valid:pd.DataFrame()):
    date_begin = valid.dt.min()
    date_end = valid.dt.max()
    x = valid[valid.label == 1].dt.min()
    # p
    p_n_tpp_df = valid[(valid.dt >= x) &\
                       (valid.dt <= x + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]
    p_n_pp_df =  valid[(valid.dt >= date_begin + timedelta(days=30)) &\ 
                       (valid.dt <= date_end + timedelta(days=30)) &\
                       (p_n_tpp_df.label == 1)]
    p_n_tpp = len([x for x in pred.serial_number.values\ 
                     if x in p_n_tpp_df.serial_number.unique()])
    p_n_pp = len([x for x in pred.serial_number.values\ 
                    if x in p_n_pp_df.serial_number.unique()])
    p = p_n_tpp / p_n_pp
    print('p: ', p)
    # r
    p_n_tpr_df = valid[(valid.dt >= date_begin - timedelta(days=30)) &\ 
                      (valid.dt <= date_end - timedelta(days=30)) &\
                      (p_n_tpr_df.label == 1)]
    p_n_pr_df = valid[(valid.dt >= date_begin) &\ 
                      (valid.dt <= date_end) &\ 
                      (p_n_pr_df.label == 1)]
    p_n_tpr = len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])
    p_n_pr = len([x for x in pred.serial_number.values\
                    if x in p_n_pr_df.serial_number.unique()])
    r = p_n_tpr / p_n_pr
    print('p: ', r)
    m = 2 * p * r / (p + r)
    return m
The pd.DataFrame() of pred and valid have the same columns and dt has no intersections. 
And the all the values of serial_number in valid is a subset of all the values of serial_number in pred.
The label column only has 2 values: 0 or 1.
Here is the sample of pred and valid is as follows:
print(pred.head(3))
    serial_number  dt          label  
0   123            2011-03-21  1
1   52             2011-03-22  0
2   12             2011-03-01  1
..., ...
print(pred.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8
..., ...
print(valid.head(3))
    serial_number  dt          label  
0   324            2011-04-22  1
1   52             2011-04-22  0
2   14             2011-04-01  1
..., ...
print(valid.info())
Int64Index: 10000000 entries,
Data columns (total 3 columns):
serial_number  int32
dt             datetimes64[ns]
label          int8
And the size of input pd.DataFrame is about 10, 000, 000 samples and 3 features.
When I try to use it to calculate this metric, it is really slow and time spending is more than 2 hours on Intel 9600KF.
So I am wondering how to optimize such code on time cost.
Thanks in advance.
Optimize Program Algorithm For any code, you should always allocate some time to think the right algorithm to use. So, the first task is to select and improve the algorithm which will be frequently used in the code. 2. Avoid Type Conversion Whenever possible, plan to use the same type of variables for processing.
Here is the biggest performance win in the code that you have:
len([x for x in pred.serial_number.values\
                     if x in p_n_tpr_df.serial_number.unique()])
Any line that looks like this is getting the size of the set intersection of pred.serial_number and p_n_tpr_df.serial_number. Using numpy rather than the list comprehension and the unique call will save substantial compute time:
intersect_size = np.intersect1d(pred.serial_number.values,
                                p_n_tpr_df.serial_number.values).shape[0]
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With