Python Pandas: Why is numpy so much faster than Pandas for column assignment? Can I optimize further?

Tags:

I am preprocessing data for a Machine Learning classification task by converting categorical variables to a binary matrix, primarily using pd.get_dummies(). This is applied to a single Pandas DataFrame column and outputs a new DataFrame with the same number of rows as the original and width of unique number of categorical variables in the original column.

I need to complete this for a DataFrame of shape: (3,000,000 x 16) which outputs a binary matrix of shape: (3,000,000 x 600).

During the process, the step of converting to a binary matrix pd.get_dummies() is very quick, but the assignment to the output matrix was much slower using pd.DataFrame.loc[]. Since I have switch to saving straight to a np.ndarray which is much faster, I just wonder why? (Please see terminal output at bottom of question for time comparison)

n.b. As pointed out in comments, I could just all pd.get_dummies() on entire frame. However, some of the columns require tailored preprocessing, i.e: putting into buckets. The most difficult column to handle is a column containing a string of tags (seperated by , or ,, which must be processed like this: df[col].str.replace(' ','').str.get_dummies(sep=','). Also, the preprocessed training set and test set need the same set of columns (inherited from all_cols) as they might not have the same features present once they are broken into a matrix.

Please see code below for each version

DataFrame version:

def preprocess_df(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))
    # x = pd.DataFrame(columns=all_cols)

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            x.loc[:, dummy_col] = df_col[dummy_col]

        print "Assigned: ", col,  datetime.datetime.now()

    return x.values

np version:

def preprocess_np(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            idx = [i for i,j in enumerate(all_cols) if j == dummy_col][0]
            x[:, idx] = df_col[dummy_col].values.T

        print "Assigned: ", col,  datetime.datetime.now()

    return x

Timed outputs (10,000 examples)

DataFrame version:

Processed:  Weekday 
Assigned:  Weekday 0.437081  
Processed:  Hour 0.002366
Assigned:  Hour 1.33815

np version:

Processed:  Weekday   
Assigned:  Weekday 0.006992
Processed:  Hour 0.002632
Assigned:  Hour 0.008989

Is there a different approach to further optimize this? I am interested as at the moment I am discarding a potentially useful feature as it is too slow to process an extra 15,000 columns to the output.

Any general advice on the approach I am taking is also appreciated!

Thank you

537

asked Apr 09 '16 13:04

jfive

1 Answers

One experiment would be to change over to x.loc[:, dummy_col] = df_col[dummy_col].values. If the input is a series, pandas is checking the order of the indices for each assignment. Assigning with an ndarray would turn that off if it's unnecessary, and that should improve performance.

108

answered Nov 14 '22 23:11

Cyrus

Related questions
                            
                                Implementing CSRF protection in a Python REST API
                            
                                Make PRNGs Agree Across Software
                            
                                Correctly installing pyOpenSSL for Python (Windows)
                            
                                How to preserve Excel text formatting when reading/writing Excel files with Pandas?
                            
                                How to generate JSON-API data attribute vs results attribute in Django Rest Framework JSON API?
                            
                                How do I match similar coordinates using Python?
                            
                                Install matplotlib to pypy
                            
                                Sharing information between a python code and c++ code (IPC)
                            
                                Pandas Granger Causality
                            
                                Day delta for dates >292 years apart
                            
                                How do I connect to MySQL without using a password? (PyMySQL)
                            
                                Py2Exe Error: Missing run-py3.5-win-amd64.exe
                            
                                Django URL without trailing slash not working
                            
                                Why is My Minimax Not Expanding and Making Moves Correctly?
                            
                                How to unfocus (blur) Python-gi GTK+3 window on Linux
                            
                                Managing connection creation in Python?
                            
                                Python - Increase recursion limit in mac osx
                            
                                Exporting Interactive Jupyter Notebook to html
                            
                                Access a page that require Safenet USB Token from urllib2 ot httplib
                            
                                SQLAlchemy - bulk insert ignore: "Duplicate entry"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Pandas: Why is numpy so much faster than Pandas for column assignment? Can I optimize further?

Tags:

python

indexing

pandas

numpy

jfive

People also ask

1 Answers

Cyrus

Recent Activity

Donate For Us