While profiling the memory consumption of my algorithm, I was surprised that sometimes for smaller inputs more memory was needed.
It all boils down to the following usage of pandas.unique()
:
import numpy as np
import pandas as pd
import sys
N=int(sys.argv[1])
a=np.arange(N, dtype=np.int64)
b=pd.unique(a)
with N=6*10^7
it needs 3.7GB
peak memory, but with N=8*10^7
"only" 3GB
.
Scanning different input-size yields the following graph:
Out of curiosity and for self-education: How can the counterintuitive behavior (i.e. more memory for smaller input size) around N=5*10^7
, N=1.3*10^7
be explained?
Here are the scripts for producing the memory consumption graph on Linux:
pandas_unique_test.py:
import numpy as np
import pandas as pd
import sys
N=int(sys.argv[1])
a=np.arange(N, dtype=np.int64)
b=pd.unique(a)
show_memory.py:
import sys
import matplotlib.pyplot as plt
ns=[]
mems=[]
for line in sys.stdin.readlines():
n,mem = map(int, line.strip().split(" "))
ns.append(n)
mems.append(mem)
plt.plot(ns, mems, label='peak-memory')
plt.xlabel('n')
plt.ylabel('peak memory in KB')
ymin, ymax = plt.ylim()
plt.ylim(0,ymax)
plt.legend()
plt.show()
run_perf_test.sh:
WRAPPER="/usr/bin/time -f%M" #peak memory in Kb
N=1000000
while [ $N -lt 100000000 ]
do
printf "$N "
$WRAPPER python pandas_unique_test.py $N
N=`expr $N + 1000000`
done
And now:
sh run_perf_tests.sh 2>&1 | python show_memory.py
Let's see...
pandas.unique
says it's a "hash-table based unique".
It calls this function to acquire the correct hash table implementation for your data, namely htable.Int64HashTable
.
The hash table is initialized with size_hint
= the length of your value vector. That means kh_resize_DTYPE(table, size_hint)
gets called.
Those functions are defined (templated) here in khash.h
.
It seems to allocate (size_hint >> 5) * 4 + (size_hint) * 8 * 2
bytes of memory for the buckets (maybe more, maybe less, I might be off here).
Then, HashTable.unique()
is called.
It allocates an empty Int64Vector
, which seem to quadruple their size whenever they get filled, starting from 128.
It then iterates over your values, figuring out whether they're in the hash table; if not, they get added to both the hash table and the vector. (This is where the vector may grow; the hash table shouldn't need to grow due to the size hint.)
Finally, a NumPy ndarray
is made to point at the vector.
So uh, I think you're seeing the vector size quadrupling at certain thresholds (which should be, if my late-night math stands,
>>> [2 ** (2 * i - 1) for i in range(4, 20)]
[
128,
512,
2048,
8192,
32768,
131072,
524288,
2097152,
8388608,
33554432,
134217728,
536870912,
2147483648,
8589934592,
34359738368,
137438953472,
...,
]
Hope this sheds some light at things :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With