I am preprocessing data for a Machine Learning classification task by converting categorical variables to a binary matrix, primarily using pd.get_dummies()
. This is applied to a single Pandas DataFrame column and outputs a new DataFrame with the same number of rows as the original and width of unique number of categorical variables in the original column.
I need to complete this for a DataFrame of shape: (3,000,000 x 16)
which outputs a binary matrix of shape: (3,000,000 x 600)
.
During the process, the step of converting to a binary matrix pd.get_dummies()
is very quick, but the assignment to the output matrix was much slower using pd.DataFrame.loc[]
. Since I have switch to saving straight to a np.ndarray
which is much faster, I just wonder why? (Please see terminal output at bottom of question for time comparison)
n.b. As pointed out in comments, I could just all pd.get_dummies()
on entire frame. However, some of the columns require tailored preprocessing, i.e: putting into buckets. The most difficult column to handle is a column containing a string of tags (seperated by ,
or ,
, which must be processed like this: df[col].str.replace(' ','').str.get_dummies(sep=',')
. Also, the preprocessed training set and test set need the same set of columns (inherited from all_cols) as they might not have the same features present once they are broken into a matrix.
Please see code below for each version
DataFrame version:
def preprocess_df(df):
with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
cols = pickle.load(handle)
x = np.zeros(shape=(len(df),len(cols)))
# x = pd.DataFrame(columns=all_cols)
for col in df.columns:
# 1. make binary matrix
df_col = pd.get_dummies(df[col], prefix=str(col))
print "Processed: ", col, datetime.datetime.now()
# 2. assign each value in binary matrix to col in output
for dummy_col in df_col.columns:
x.loc[:, dummy_col] = df_col[dummy_col]
print "Assigned: ", col, datetime.datetime.now()
return x.values
np version:
def preprocess_np(df):
with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
cols = pickle.load(handle)
x = np.zeros(shape=(len(df),len(cols)))
for col in df.columns:
# 1. make binary matrix
df_col = pd.get_dummies(df[col], prefix=str(col))
print "Processed: ", col, datetime.datetime.now()
# 2. assign each value in binary matrix to col in output
for dummy_col in df_col.columns:
idx = [i for i,j in enumerate(all_cols) if j == dummy_col][0]
x[:, idx] = df_col[dummy_col].values.T
print "Assigned: ", col, datetime.datetime.now()
return x
Timed outputs (10,000
examples)
DataFrame version:
Processed: Weekday
Assigned: Weekday 0.437081
Processed: Hour 0.002366
Assigned: Hour 1.33815
np version:
Processed: Weekday
Assigned: Weekday 0.006992
Processed: Hour 0.002632
Assigned: Hour 0.008989
Is there a different approach to further optimize this? I am interested as at the moment I am discarding a potentially useful feature as it is too slow to process an extra 15,000
columns to the output.
Any general advice on the approach I am taking is also appreciated!
Thank you
NumPy can be said to be faster in performance than Pandas, up to fifty thousand rows and less of the dataset. (The performance between fifty thousand rows to five hundred thousand rows mostly depends on the type of operation Pandas, and NumPy are going to have to perform.)
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
One experiment would be to change over to x.loc[:, dummy_col] = df_col[dummy_col].values
. If the input is a series, pandas is checking the order of the indices for each assignment. Assigning with an ndarray would turn that off if it's unnecessary, and that should improve performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With