I swear I saw this discussion somewhere some time ago but I cannot find this anywhere anymore.
Imagine I have this method:
def my_method():
df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
return df[df['val'] == 1]
It has been some time since I decided not to do this because the method could return a view (this is not a certainty, depends on what pandas wants to do) instead of a new dataframe.
The issue with this, I read, is that if a view is returned the refcount in the original dataframe is not reduced because the is still referencing that old dataframe even though we are only using a small portion of the data.
I was advised to instead do the following:
def my_method():
df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
return df.drop(df[df["val"] != 1].index)
In this case, the drop method creates a new dataframe only with the data we want to keep and as soon as the method finishes the refcount in the original dataframe would be set to zero making it susceptible to garbage collection and eventually freeing up the memory.
In summary, this would be much more memory friendly and will also ensure that the result of the method is a dataframe and not a view of a dataframe which can lead to the settingOnCopyWarning
we all love.
Is this still true? Or is it something I misread somewhere? I have tried to check whether this has some benefit on memory usage but given that I cannot control when the gc decides to "remove" things from memory, just ask it to collect stuff... I never seem to have any conclusive results.
Changing numeric columns to smaller dtype: Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.
By observing feature values Pandas decides data type and loads it in the RAM. A value with data type as int8 takes 8x times less memory compared to int64 data type.
The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage. When it needs access again, it will read the data from the swap file and back into memory.
You can always use df.query() method and by using the inplace=True
you can set the result on the original dataset and don't need to create a copy dataset.
Code :
def my_method_3(df):
return df.query('val == 1',inplace=True)
my_method_3(df)
Also the method:
def my_method():
df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
return df.drop(df[df["val"] != 1].index)
might not be very efficient for large datasets. I tried clocking a benchmark of this method and could see the following:
CPU times: user 327 ms, sys: 51.4 ms, total: 379 ms Wall time: 394 ms
.
Whereas in contrast the df.query method took CPU times: user 14.3 ms, sys: 7.39 ms, total: 21.7 ms Wall time: 18.6 ms
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With