In my Python application, I find it handy to use a dictionary of dictionaries as the source data for constructing a sparse pandas DataFrame, which I then use to train a model in sklearn.
The structure of the dictionary is like this:
data = {"X": {'a': 1, 'b': 2, 'c': 3}, "Y": {'d': 4, 'e': 5, 'f': 6}, "Z": {'g': 7, 'h': 8, 'i': 9}}
Ideally, I'd like turn it into a dataframe like this:
df = pandas.DataFrame.from_dict(data, orient="index").fillna(0).astype(int)
Which generates this:
e d f a c b i h g
X 0 0 0 1 3 2 0 0 0
Y 5 4 6 0 0 0 0 0 0
Z 0 0 0 0 0 0 9 8 7
Now, here's my problem. My data has a number of rows in the hundreds of thousands (ie, the number of keys in the outer dictionary). Each one of these has only a handful of columns associated with it (ie, the number of keys in each inner dictionary), but the total number of columns numbers in the thousands. I've found DataFrame generation using from_dict to be very slow, on the order of 2.5-3 minutes for 200,000 rows and 6,000 columns.
Furthermore, in the case when the row index is a MultiIndex (ie, instead of X, Y and Z the keys of the outer directionary are tuples), from_dict is even slower, on the order of 7+ minutes for 200,000 rows. I've found that this overhead can be avoided if instead of a dictionary of dictionaries, one uses a list of dictionaries and then adds the MultiIndex back to the resulting DataFrame using set_index.
In summary, how would you suggest I deal with this? Performance with the MultiIndex can clearly be improved by the library developers, but am I using the wrong tool for the job here? If written to disk, the DataFrame is around 2.5GB in size. Reading a 2.5GB file from disk in around 2 or so minutes seems about right, but the sparsity of my data in memory should theoretically allow this to be much faster.
It turns out that sklearn has a class that does exactly what I need.
sklearn.feature_extraction.DictVectorizer
I generate the data as a list of dictionaries, setting the row labels aside. And then:
vectorizer = sklearn.feature_extraction.DictVectorizer(dtype=numpy.uint8,
sparse=False)
matrix = vectorizer.fit_transform(data)
column_labels = vectorizer.get_feature_names()
df = pandas.DataFrame(matrix, index=row_labels, columns=column_labels)
That finishes within a minute or so, which is fast enough for me. Maybe somebody can improve it further.
My advice is to use sparse matrices and replace letters with numeric (row / column) identifiers.
Below is an example with benchmarking on your minimal example.
import pandas as pd, numpy as np
from scipy.sparse import coo_matrix
def original(data):
df = pd.DataFrame.from_dict(data, orient="index").fillna(0).astype(int)
return df
def jp(data):
res = {(ord(k), ord(i)): j for k, v in data.items() for i, j in v.items()}
n = len(res)
rows = np.array(pd.factorize(list(zip(*res.keys()))[0])[0])
cols = np.array(pd.factorize(list(zip(*res.keys()))[1])[0])
values = np.array(list(res.values()))
return pd.DataFrame(coo_matrix((values, (rows, cols)),
shape=(len(np.unique(rows)), n)).toarray())
%timeit original(data) # 1.45 ms
%timeit jp(data) # 488 µs
If you wish, rename your index / columns as a separate step. I haven't tested this, but my instinct is the method will still be reasonably fast with this step.
Result
0 1 2 3 4 5 6 7 8
0 1 2 3 0 0 0 0 0 0
1 0 0 0 4 5 6 0 0 0
2 0 0 0 0 0 0 7 8 9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With