Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error?

Tags:

I am processing a large file of records with 2,000,000 rows. Each line contains features about emails and a binary label [0,1] for non-spam or spam respectively.

I want to convert all features such as email_type which takes on values from [1,10] to a binary matrix.

This can be accomplished using pd.get_dummies(), which creates a binary matrix from a column of features.

This works perfectly on a small subsample of the data, say 10,000 rows. However, for 100,000+ rows, I see the error Killed:9.

To tackle this, I have tried the following:

Steps:

Split the DataFrame into chunks of 10,000 rows using numpyp.array_split()
Create a binary matrix for each DataFrame of 10,000 rows
Append these to a list of DataFrames
Concatenate these DataFrames together (I am doing this to preserve the difference in columns that each block will contain)

Code:

# break into chunks
chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)
super_x = []
super_y = []

# loop through chunks
for i, df_chunk in enumerate(df_list):
    # preprocess_data() returns x,y (both DataFrames)
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
    super_y.append(y)

# vertically concatenate DataFrames
super_x_mat = pd.concat(super_x, axis=0).fillna(0)
super_y_mat = pd.concat(super_y, axis=0)

# pickle (in case of further preprocessing)
super_x_mat.to_pickle('super_x_mat.p')
super_y_mat.to_pickle('super_y_mat.p')

# return values as np.ndarray
x = super_x_mat.values
y = super_y_mat.values
return[x, y]

Some example output:

chunks 13
chunk 0 2016-04-08 12:46:55.473963
chunk 1 2016-04-08 12:47:05.942743
...
chunk 12 2016-04-08 12:49:16.318680
Killed: 9

Step 2 (Conversion to binary matrix) is out of memory after processing 32 blocks (320,000 rows), however the out of memory could occur as the chunk is appended to a list of dataframes as follows df_chunks.append(df).

Step 3 is out of memory trying to concatenate 20 successfully processed blocks (200,000 rows)

The ideal output is numpy.ndarray that I can feed to a sklearn Logistic Regression classifier.

What other approaches can I try? I am starting to approach machine learning on datasets this size more regularly.

I'm after advice and open to suggestions like:

Processing each chunk, using all possible columns from entire dataframe and saving as file before re-combining
Suggestions of file data storage
Completely other approaches using different matrices

576

asked Apr 08 '16 12:04

jfive

1 Answers

If you are doing something like one-hot encoding, or in any case are going to have lots of zeros, have you considered using sparse matrices? This should be done after the pre-processing e.g.:

[x, y] = preprocess_data(df_chunk)
x = sparse.csr_matrix(x.values)
super_x.append(x)

pandas also has a sparse type:

x=x.to_sparse()
[x, y] = preprocess_data(df_chunk)
super_x.append(x)

One note: since you are cutting and joining by row, csr is preferable to csc.

127

answered Oct 21 '22 13:10

ntg

Related questions
                            
                                What does get_fscore() of an xgboost ML model do? [duplicate]
                            
                                ImportError: No module named twisted.persisted.styles
                            
                                Can't install python lxml (and libxml2) on windows
                            
                                SQLAlchemy func.count with filter
                            
                                python Sphinx "the module executes module level statement and it might call sys.exit()."
                            
                                Using PIP in a Azure WebApp
                            
                                Plotly: How to add custom legend
                            
                                Install lxml on Centos 7 - error: command 'gcc' failed with exit status 4
                            
                                use vpn with python requests
                            
                                User group assignment track in django admin
                            
                                Redis py: when to use connection pool?
                            
                                Using f-score in xgb
                            
                                "canonical" way to use logging for Python asserts
                            
                                Expressing pandas subset using pipe
                            
                                Linear Regression with positive coefficients in Python
                            
                                Theano: Initialisation of device gpu failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY
                            
                                What is the best way to top k pool elements instead of only the max one in Tensorflow?
                            
                                How to preserve Labels when SPSS file (.sav) imported into pandas via rpy?
                            
                                Remove interpolation Time series plot for missing values
                            
                                Executing `from abc import xyz` where does the module `abc` go?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error?

Tags:

performance

python

pandas

numpy

bigdata

jfive

People also ask

1 Answers

ntg

Recent Activity

Donate For Us