Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repeated insertions into sqlite database via sqlalchemy causing memory leak?

When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql and a specified chucksize, I would get memory errors.

At first I thought it was an issue with to_sql but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) and that still resulted in an error.

It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.

I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

This was run on Google Colab CPU enviroment.

The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql or one to_sql with chucksize specified.

Is there a way that this code could be run without causing an eventual increase in memory usage?

Edit:

To fully reproduce the error, run this notebook

https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS

The notebook requires you to import this folder into the main directory of your Google Drive

https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8

The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.

like image 609
SantoshGupta7 Avatar asked Jun 02 '19 20:06

SantoshGupta7


1 Answers

The Google Colab instance starts with about 12.72GB of RAM available. After creating the DataFrame, theBigList, about 9.99GB of RAM have been used. Already this is a rather uncomfortable situation to be in, since it is not unusual for Pandas operations to require as much additional space as the DataFrame it is operating on. So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy file and store its data in the sqlite database one at a time without ever creating theBigList (see below).

However, if we use the code you posted, we can see that the RAM usage slowly increases as chunks of theBigList is stored in the database iteratively.

theBigList DataFrame stores the strings in a NumPy array. But in the process of transferring the strings to the sqlite database, the NumPy strings are converted into Python strings. This takes additional memory.

Per this Theano tutoral which discusses Python internal memory management,

To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.

... The important point is that those lists never shrink.

Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x. The dead object’s location will be reused if another object of compatible size is needed. If there are no dead objects available, new ones are created.

If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.

I believe this accurately describes the behavior you are seeing as this loop executes:

for i in range(0, 588):
    theBigList.iloc[i*10000:(i+1)*10000].to_sql(
        'CS_table', engine, index=False, if_exists='append')

Even though many dead objects' locations are being reused for new strings, it is not implausible with essentially random strings such as those in theBigList that extra space will occasionally be needed and so the memory footprint keeps growing.

The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.


In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:

import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA

def load_and_store(dbpath):
    engine = SA.create_engine("sqlite:///{}".format(dbpath))    
    for i in range(0, 47):
        print('step {}: {}'.format(i, mc.report_memory()))                
        for letter in list('ABCDEF'):
            path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
            comb = np.load(path, allow_pickle=True)
            toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
            toPD.columns = ['title', 'abstract']
            toPD = toPD.loc[toPD['abstract'] != '']
            toPD.to_sql('CS_table', engine, index=False, if_exists='append')

dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)

which prints

step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...         
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551

The last number on each line is the amount of memory consumed by the process as reported by matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory() is reporting the size of the physical pages of the core image of the process (including text, data, and stack space).


By the way, another basic trick you can use manage memory is to use functions. Local variables inside the function are deallocated when the function terminates. This relieves you of the burden of manually calling del and gc.collect().

like image 80
unutbu Avatar answered Sep 26 '22 23:09

unutbu