Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error?

I am processing a large file of records with 2,000,000 rows. Each line contains features about emails and a binary label [0,1] for non-spam or spam respectively.

I want to convert all features such as email_type which takes on values from [1,10] to a binary matrix.

This can be accomplished using pd.get_dummies(), which creates a binary matrix from a column of features.

This works perfectly on a small subsample of the data, say 10,000 rows. However, for 100,000+ rows, I see the error Killed:9.

To tackle this, I have tried the following:


  1. Split the DataFrame into chunks of 10,000 rows using numpyp.array_split()
  2. Create a binary matrix for each DataFrame of 10,000 rows
  3. Append these to a list of DataFrames
  4. Concatenate these DataFrames together (I am doing this to preserve the difference in columns that each block will contain)


# break into chunks
chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)
super_x = []
super_y = []

# loop through chunks
for i, df_chunk in enumerate(df_list):
    # preprocess_data() returns x,y (both DataFrames)
    [x, y] = preprocess_data(df_chunk)

# vertically concatenate DataFrames
super_x_mat = pd.concat(super_x, axis=0).fillna(0)
super_y_mat = pd.concat(super_y, axis=0)

# pickle (in case of further preprocessing)

# return values as np.ndarray
x = super_x_mat.values
y = super_y_mat.values
return[x, y]

Some example output:

chunks 13
chunk 0 2016-04-08 12:46:55.473963
chunk 1 2016-04-08 12:47:05.942743
chunk 12 2016-04-08 12:49:16.318680
Killed: 9

Step 2 (Conversion to binary matrix) is out of memory after processing 32 blocks (320,000 rows), however the out of memory could occur as the chunk is appended to a list of dataframes as follows df_chunks.append(df).

Step 3 is out of memory trying to concatenate 20 successfully processed blocks (200,000 rows)

The ideal output is numpy.ndarray that I can feed to a sklearn Logistic Regression classifier.

What other approaches can I try? I am starting to approach machine learning on datasets this size more regularly.

I'm after advice and open to suggestions like:

  1. Processing each chunk, using all possible columns from entire dataframe and saving as file before re-combining
  2. Suggestions of file data storage
  3. Completely other approaches using different matrices
like image 576
jfive Avatar asked Apr 08 '16 12:04


People also ask

What does the Get_dummies () function in pandas do?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

Is pandas library memory efficient?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

Can Numba speed up pandas?

Though Numba can speed up numpy code, it does not speed up code involving pandas which is the most commonly used data manipulation library designed on the top of numpy.

How is pandas memory efficient?

By observing feature values Pandas decides data type and loads it in the RAM. A value with data type as int8 takes 8x times less memory compared to int64 data type.

1 Answers

If you are doing something like one-hot encoding, or in any case are going to have lots of zeros, have you considered using sparse matrices? This should be done after the pre-processing e.g.:

[x, y] = preprocess_data(df_chunk)
x = sparse.csr_matrix(x.values)

pandas also has a sparse type:

[x, y] = preprocess_data(df_chunk)

One note: since you are cutting and joining by row, csr is preferable to csc.

like image 127
ntg Avatar answered Oct 21 '22 13:10
