I am processing a large file of records with 2,000,000
rows. Each line contains features about emails and a binary label [0,1]
for non-spam or spam respectively.
I want to convert all features such as email_type
which takes on values from [1,10]
to a binary matrix.
This can be accomplished using pd.get_dummies()
, which creates a binary matrix from a column of features.
This works perfectly on a small subsample of the data, say 10,000
rows. However, for 100,000+
rows, I see the error Killed:9
.
To tackle this, I have tried the following:
Steps:
numpyp.array_split()
Code:
# break into chunks
chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)
super_x = []
super_y = []
# loop through chunks
for i, df_chunk in enumerate(df_list):
# preprocess_data() returns x,y (both DataFrames)
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
super_y.append(y)
# vertically concatenate DataFrames
super_x_mat = pd.concat(super_x, axis=0).fillna(0)
super_y_mat = pd.concat(super_y, axis=0)
# pickle (in case of further preprocessing)
super_x_mat.to_pickle('super_x_mat.p')
super_y_mat.to_pickle('super_y_mat.p')
# return values as np.ndarray
x = super_x_mat.values
y = super_y_mat.values
return[x, y]
Some example output:
chunks 13
chunk 0 2016-04-08 12:46:55.473963
chunk 1 2016-04-08 12:47:05.942743
...
chunk 12 2016-04-08 12:49:16.318680
Killed: 9
Step 2 (Conversion to binary matrix) is out of memory after processing 32
blocks (320,000
rows), however the out of memory could occur as the chunk is appended to a list of dataframes as follows df_chunks.append(df)
.
Step 3 is out of memory trying to concatenate 20
successfully processed blocks (200,000
rows)
The ideal output is numpy.ndarray
that I can feed to a sklearn
Logistic Regression classifier.
What other approaches can I try? I am starting to approach machine learning on datasets this size more regularly.
I'm after advice and open to suggestions like:
get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
Though Numba can speed up numpy code, it does not speed up code involving pandas which is the most commonly used data manipulation library designed on the top of numpy.
By observing feature values Pandas decides data type and loads it in the RAM. A value with data type as int8 takes 8x times less memory compared to int64 data type.
If you are doing something like one-hot encoding, or in any case are going to have lots of zeros, have you considered using sparse matrices? This should be done after the pre-processing e.g.:
[x, y] = preprocess_data(df_chunk)
x = sparse.csr_matrix(x.values)
super_x.append(x)
pandas also has a sparse type:
x=x.to_sparse()
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
One note: since you are cutting and joining by row, csr is preferable to csc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With