When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql
and a specified chucksize, I would get memory errors.
At first I thought it was an issue with to_sql
but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...)
and that still resulted in an error.
It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.
I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
This was run on Google Colab CPU enviroment.
The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql
or one to_sql
with chucksize specified.
Is there a way that this code could be run without causing an eventual increase in memory usage?
Edit:
To fully reproduce the error, run this notebook
https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS
The notebook requires you to import this folder into the main directory of your Google Drive
https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8
The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.
The Google Colab instance starts with about 12.72GB of RAM available.
After creating the DataFrame, theBigList
, about 9.99GB of RAM have been used.
Already this is a rather uncomfortable situation to be in, since it is not unusual for
Pandas operations to require as much additional space as the DataFrame it is operating on.
So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy
file and store its data in the sqlite database one at a time without ever creating theBigList
(see below).
However, if we use the code you posted, we can see that the RAM usage slowly increases
as chunks of theBigList
is stored in the database iteratively.
theBigList
DataFrame stores the strings in a NumPy array. But in the process
of transferring the strings to the sqlite database, the NumPy strings are
converted into Python strings. This takes additional memory.
Per this Theano tutoral which discusses Python internal memory management,
To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.
... The important point is that those lists never shrink.
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x. The dead object’s location will be reused if another object of compatible size is needed. If there are no dead objects available, new ones are created.
If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.
I believe this accurately describes the behavior you are seeing as this loop executes:
for i in range(0, 588):
theBigList.iloc[i*10000:(i+1)*10000].to_sql(
'CS_table', engine, index=False, if_exists='append')
Even though many dead objects' locations are being reused for new strings, it is
not implausible with essentially random strings such as those in theBigList
that extra space will occasionally be
needed and so the memory footprint keeps growing.
The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.
In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:
import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA
def load_and_store(dbpath):
engine = SA.create_engine("sqlite:///{}".format(dbpath))
for i in range(0, 47):
print('step {}: {}'.format(i, mc.report_memory()))
for letter in list('ABCDEF'):
path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
comb = np.load(path, allow_pickle=True)
toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
toPD.columns = ['title', 'abstract']
toPD = toPD.loc[toPD['abstract'] != '']
toPD.to_sql('CS_table', engine, index=False, if_exists='append')
dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)
which prints
step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551
The last number on each line is the amount of memory consumed by the process as reported by
matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory()
is reporting
the size of the physical pages of the core image of the process (including text, data, and stack space).
By the way, another basic trick you can use manage memory is to use functions.
Local variables inside the function are deallocated when the function terminates.
This relieves you of the burden of manually calling del
and gc.collect()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With