I have a piece of code which receives call-back from another function and creates a list of list (pd_arr). This list is then used to create a data frame. Finally the list of list is deleted.
On profiling using memory-profiler, this is the output
102.632812 MiB 0.000000 MiB init()
236.765625 MiB 134.132812 MiB add_to_list()
return pd.DataFrame()
394.328125 MiB 157.562500 MiB pd_df = pd.DataFrame(pd_arr, columns=df_columns)
350.121094 MiB -44.207031 MiB pd_df = pd_df.set_index(df_columns[0])
350.292969 MiB 0.171875 MiB pd_df.memory_usage()
350.328125 MiB 0.035156 MiB print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0]), sys.getsizeof(pd_df), len(pd_arr)
350.328125 MiB 0.000000 MiB del pd_arr
On checking deep memory usage of pd_df (data frame), it is 80.5 MB. So, my question is why does the memory not decrement after del pd_arr
line.
Also, total data frame size as per profiler (157 - 44 = 110 MB) seems to be more than 80 MB. So, what causes the difference?
Also, is there any other memory-efficient way to create data frame (data received in loop) which is not too bad in time performance (For eg: increment of 10s of seconds should be fine for data-frame of size 100MB)?
Edit: Simple python script which explains this behaviour
Filename: py_test.py
Line # Mem usage Increment Line Contents
================================================
9 102.0 MiB 0.0 MiB @profile
10 def setup():
11 global arr, size
12 102.0 MiB 0.0 MiB arr = range(1, size)
13 131.2 MiB 29.1 MiB arr = [x+1 for x in arr]
Filename: py_test.py
Line # Mem usage Increment Line Contents
================================================
21 131.2 MiB 0.0 MiB @profile
22 def tearDown():
23 global arr
24 131.2 MiB 0.0 MiB del arr[:]
25 131.2 MiB 0.0 MiB del arr
26 93.7 MiB -37.4 MiB gc.collect()
On introducing dataframe,
Filename: py_test.py
Line # Mem usage Increment Line Contents
================================================
9 102.0 MiB 0.0 MiB @profile
10 def setup():
11 global arr, size
12 102.0 MiB 0.0 MiB arr = range(1, size)
13 132.7 MiB 30.7 MiB arr = [x+1 for x in arr]
Filename: py_test.py
Line # Mem usage Increment Line Contents
================================================
15 132.7 MiB 0.0 MiB @profile
16 def dfCreate():
17 global arr
18 147.1 MiB 14.4 MiB pd_df = pd.DataFrame(arr)
19 147.1 MiB 0.0 MiB return pd_df
Filename: py_test.py
Line # Mem usage Increment Line Contents
================================================
21 147.1 MiB 0.0 MiB @profile
22 def tearDown():
23 global arr
24 #del arr[:]
25 147.1 MiB 0.0 MiB del arr
26 147.1 MiB 0.0 MiB gc.collect()
Answering your first question, when you try to clean out memory using del pd_arr
actually this doesn't happen because DataFrame
stores one link to pd_arr
, and top scope keeps one more link; decreasing refcounter won't collect memory, because this memory is under use.
You may check my assumption by running sys.getrefcount(pd_arr)
before del pd_arr
and you will get 2
as a result.
Now, I believe that the following code snippet does the same what you're trying to do: https://gist.github.com/vladignatyev/ec7a26b7042efd6f710d436afbfb87de/90df8cc6bbb8bd0cb3a1d2670e03aff24f3a5b24
If you try this snippet, you will see the memory usage as follows:
Line # Mem usage Increment Line Contents
================================================
13 63.902 MiB 0.000 MiB @profile
14 def to_profile():
15 324.828 MiB 260.926 MiB pd_arr = make_list()
16 # pd_df = pd.DataFrame.from_records(pd_arr, columns=[x for x in range(0,1000)])
17 479.094 MiB 154.266 MiB pd_df = pd.DataFrame(pd_arr)
18 # pd_df.info(memory_usage='deep')
19 479.094 MiB 0.000 MiB print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
20 481.055 MiB 1.961 MiB print sys.getsizeof(pd_df), len(pd_arr)
21 481.055 MiB 0.000 MiB print sys.getrefcount(pd_arr)
22 417.090 MiB -63.965 MiB del pd_arr
23 323.090 MiB -94.000 MiB gc.collect()
Try this example:
@profile
def test():
a = [x for x in range(0,100000)]
del a
aa = test()
You will get exactly what you expect:
Line # Mem usage Increment Line Contents
================================================
6 64.117 MiB 0.000 MiB @profile
7 def test():
8 65.270 MiB 1.152 MiB a = [x for x in range(0,100000)]
9 # print sys.getrefcount(a)
10 64.133 MiB -1.137 MiB del a
11 64.133 MiB 0.000 MiB gc.collect()
Also, if you call sys.getrefcount(a)
, the memory sometimes will be cleaned before del a
:
Line # Mem usage Increment Line Contents
================================================
6 63.828 MiB 0.000 MiB @profile
7 def test():
8 65.297 MiB 1.469 MiB a = [x for x in range(0,100000)]
9 64.230 MiB -1.066 MiB print sys.getrefcount(a)
10 64.160 MiB -0.070 MiB del a
But things go wild when you use pandas
.
If you open the source code of pandas.DataFrame
, you will see, that in the case when you initialize DataFrame
with list
, pandas
creates new NumPy array and copies it's content. Check this out: https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L329
Deleting pd_arr
won't free the memory, because pd_arr
will be collected after DataFrame
creation and exiting your function anyway, since it doesn't have any additional links to it. getrefcount
call before and after proves this.
Creating new DataFrame
from plain list make your list copied with NumPy Array. (Look at np.array(data, dtype=dtype, copy=copy)
and the corresponding documentation about array
)
Copying operation may affect the time of execution, because allocating new memory block is a heavy operation.
I've tried to initialize new DataFrame with Numpy array instead. The only difference is where numpy.Array
memory overhead appears. Compare the following two snippets:
def make_list(): # 1
pd_arr = []
for i in range(0,10000):
pd_arr.append([x for x in range(0,1000)])
return np.array(pd_arr)
and
def make_list(): #2
pd_arr = []
for i in range(0,10000):
pd_arr.append([x for x in range(0,1000)])
return pd_arr
Number #1 (creating DataFrame doesn't produce Memory Usage Overhead!):
Line # Mem usage Increment Line Contents
================================================
14 63.672 MiB 0.000 MiB @profile
15 def to_profile():
16 385.309 MiB 321.637 MiB pd_arr = make_list()
17 385.309 MiB 0.000 MiB print sys.getrefcount(pd_arr)
18 385.316 MiB 0.008 MiB pd_df = pd.DataFrame(pd_arr)
19 385.316 MiB 0.000 MiB print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
20 386.934 MiB 1.617 MiB print sys.getsizeof(pd_df), len(pd_arr)
21 386.934 MiB 0.000 MiB print sys.getrefcount(pd_arr)
22 386.934 MiB 0.000 MiB del pd_arr
23 305.934 MiB -81.000 MiB gc.collect()
Number #2 (over 100Mb overhead due to copying of array)!:
Line # Mem usage Increment Line Contents
================================================
14 63.652 MiB 0.000 MiB @profile
15 def to_profile():
16 325.352 MiB 261.699 MiB pd_arr = make_list()
17 325.352 MiB 0.000 MiB print sys.getrefcount(pd_arr)
18 479.633 MiB 154.281 MiB pd_df = pd.DataFrame(pd_arr)
19 479.633 MiB 0.000 MiB print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
20 481.602 MiB 1.969 MiB print sys.getsizeof(pd_df), len(pd_arr)
21 481.602 MiB 0.000 MiB print sys.getrefcount(pd_arr)
22 417.621 MiB -63.980 MiB del pd_arr
23 330.621 MiB -87.000 MiB gc.collect()
So, initialize DataFrame
only with Numpy Array, not a list
. It is better from the memory consumption perspective and probably faster, because it doesn't require additional memory allocation call.
Hopefully, now I've answered all of your questions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With