Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Curious memory consumption of pandas.unique()

While profiling the memory consumption of my algorithm, I was surprised that sometimes for smaller inputs more memory was needed.

It all boils down to the following usage of pandas.unique():

import numpy as np
import pandas as pd
import sys

N=int(sys.argv[1])

a=np.arange(N, dtype=np.int64)
b=pd.unique(a)

with N=6*10^7 it needs 3.7GB peak memory, but with N=8*10^7 "only" 3GB.

Scanning different input-size yields the following graph:

enter image description here

Out of curiosity and for self-education: How can the counterintuitive behavior (i.e. more memory for smaller input size) around N=5*10^7, N=1.3*10^7 be explained?


Here are the scripts for producing the memory consumption graph on Linux:

pandas_unique_test.py:

import numpy as np
import pandas as pd
import sys

N=int(sys.argv[1])    
a=np.arange(N, dtype=np.int64)
b=pd.unique(a)

show_memory.py:

import sys
import matplotlib.pyplot as plt   
ns=[]
mems=[]
for line in sys.stdin.readlines():
    n,mem = map(int, line.strip().split(" "))
    ns.append(n)
    mems.append(mem)
plt.plot(ns, mems, label='peak-memory')
plt.xlabel('n')
plt.ylabel('peak memory in KB')
ymin, ymax = plt.ylim()
plt.ylim(0,ymax)
plt.legend()
plt.show()

run_perf_test.sh:

WRAPPER="/usr/bin/time -f%M" #peak memory in Kb
N=1000000
while [ $N -lt 100000000 ]
do
   printf "$N "
   $WRAPPER python pandas_unique_test.py $N
   N=`expr $N + 1000000`
done 

And now:

sh run_perf_tests.sh  2>&1 | python show_memory.py
like image 897
ead Avatar asked Jul 23 '18 19:07

ead


1 Answers

Let's see...

pandas.unique says it's a "hash-table based unique".

It calls this function to acquire the correct hash table implementation for your data, namely htable.Int64HashTable.

The hash table is initialized with size_hint = the length of your value vector. That means kh_resize_DTYPE(table, size_hint) gets called.

Those functions are defined (templated) here in khash.h.

It seems to allocate (size_hint >> 5) * 4 + (size_hint) * 8 * 2 bytes of memory for the buckets (maybe more, maybe less, I might be off here).

Then, HashTable.unique() is called.

It allocates an empty Int64Vector, which seem to quadruple their size whenever they get filled, starting from 128.

It then iterates over your values, figuring out whether they're in the hash table; if not, they get added to both the hash table and the vector. (This is where the vector may grow; the hash table shouldn't need to grow due to the size hint.)

Finally, a NumPy ndarray is made to point at the vector.

So uh, I think you're seeing the vector size quadrupling at certain thresholds (which should be, if my late-night math stands,

>>> [2 ** (2 * i - 1) for i in range(4, 20)]
[
    128,
    512,
    2048,
    8192,
    32768,
    131072,
    524288,
    2097152,
    8388608,
    33554432,
    134217728,
    536870912,
    2147483648,
    8589934592,
    34359738368,
    137438953472,
    ...,
]

Hope this sheds some light at things :)

like image 80
AKX Avatar answered Sep 30 '22 19:09

AKX