Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: Why is numpy so much faster than Pandas for column assignment? Can I optimize further?

I am preprocessing data for a Machine Learning classification task by converting categorical variables to a binary matrix, primarily using pd.get_dummies(). This is applied to a single Pandas DataFrame column and outputs a new DataFrame with the same number of rows as the original and width of unique number of categorical variables in the original column.

I need to complete this for a DataFrame of shape: (3,000,000 x 16) which outputs a binary matrix of shape: (3,000,000 x 600).

During the process, the step of converting to a binary matrix pd.get_dummies() is very quick, but the assignment to the output matrix was much slower using pd.DataFrame.loc[]. Since I have switch to saving straight to a np.ndarray which is much faster, I just wonder why? (Please see terminal output at bottom of question for time comparison)

n.b. As pointed out in comments, I could just all pd.get_dummies() on entire frame. However, some of the columns require tailored preprocessing, i.e: putting into buckets. The most difficult column to handle is a column containing a string of tags (seperated by , or ,, which must be processed like this: df[col].str.replace(' ','').str.get_dummies(sep=','). Also, the preprocessed training set and test set need the same set of columns (inherited from all_cols) as they might not have the same features present once they are broken into a matrix.

Please see code below for each version

DataFrame version:

def preprocess_df(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))
    # x = pd.DataFrame(columns=all_cols)

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            x.loc[:, dummy_col] = df_col[dummy_col]

        print "Assigned: ", col,  datetime.datetime.now()

    return x.values

np version:

def preprocess_np(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            idx = [i for i,j in enumerate(all_cols) if j == dummy_col][0]
            x[:, idx] = df_col[dummy_col].values.T

        print "Assigned: ", col,  datetime.datetime.now()

    return x

Timed outputs (10,000 examples)

DataFrame version:

Processed:  Weekday 
Assigned:  Weekday 0.437081  
Processed:  Hour 0.002366
Assigned:  Hour 1.33815

np version:

Processed:  Weekday   
Assigned:  Weekday 0.006992
Processed:  Hour 0.002632
Assigned:  Hour 0.008989

Is there a different approach to further optimize this? I am interested as at the moment I am discarding a potentially useful feature as it is too slow to process an extra 15,000 columns to the output.

Any general advice on the approach I am taking is also appreciated!

Thank you

like image 537
jfive Avatar asked Apr 09 '16 13:04

jfive


People also ask

Why is NumPy so much faster than Pandas?

NumPy can be said to be faster in performance than Pandas, up to fifty thousand rows and less of the dataset. (The performance between fifty thousand rows to five hundred thousand rows mostly depends on the type of operation Pandas, and NumPy are going to have to perform.)

Which is more efficient Pandas or NumPy?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

Why is NumPy so fast?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.


1 Answers

One experiment would be to change over to x.loc[:, dummy_col] = df_col[dummy_col].values. If the input is a series, pandas is checking the order of the indices for each assignment. Assigning with an ndarray would turn that off if it's unnecessary, and that should improve performance.

like image 108
Cyrus Avatar answered Nov 14 '22 23:11

Cyrus