I wonder why pandas has a large memory usage when reindexing a Series.
I create a simple dataset:
a = pd.Series(np.arange(5e7, dtype=np.double))
According to top on my Ubuntu, the whole session is about 820MB.
Now if I slice this to extract the first 100 elements:
a_sliced = a[:100]
This shows no increased memory consumption.
Instead if I reindex a on the same range:
a_reindexed = a.reindex(np.arange(100))
I get a memory consumption of about 1.8GB. Tried also to cleanup with gc.collect without success.
I would like to know if this is expected and if there is a workaround to reindex large datasets without significant memory overhead.
I am using a very recent snapshot of pandas from github.
Index uses a Hashtable to map labels to locations. You can check this by Series.index._engine.mapping. This mapping is created when necessary. If the index is_monotonic, you can use asof():
import numpy as np
import pandas as pd
idx =["%07d" % x for x in range(int(2e6))]
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
new_index = ["0000003", "0000020", "000002a"]
print a.index._engine.mapping # None
print a.reindex(new_index)
print a.index._engine.mapping # <pandas.hashtable.PyObjectHashTable object at ...>
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
print a.asof(new_index)
print a.index._engine.mapping # None
If you want more control about not exist labels, you can use searchsorted() and do the logic yourself:
>>> a.index[a.index.searchsorted(new_index)]
Index([u'0000003', u'0000020', u'0000030'], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With