Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run out of memory when saving list with numpy

Tags:

python

numpy

I have a fairly large list of lists representing the tokens in the Sogou text classification data set. I can process the entire training set of 450 000 with 12 gigs of ram left over, but when I call numpy.save() on the list of lists the memory usage seems to double and I run out of memory.

Why is this? Does the numpy.save convert the list before saving but retain the list thus using more memory?

Is there an alternative way to save this list of lists i.e pickling? I believe numpy save uses the pickle protocol judging from the allow pickle argument: https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html

print "Collecting Raw Documents, tokenize, and remove stop words"
df = pd.read_pickle(path + dataSetName + "Train")
frequency = defaultdict(int)

gen_docs = []
totalArts = len(df)
for artNum in range(totalArts):
    if artNum % 2500 == 0:
        print "Gen Docs Creation on " + str(artNum) + " of " + str(totalArts)
    bodyText = df.loc[artNum,"fullContent"]
    bodyText = re.sub('<[^<]+?>', '', str(bodyText))
    bodyText = re.sub(pun, " ", str(bodyText))
    tmpDoc = []
    for w in word_tokenize(bodyText):
        w = w.lower().decode("utf-8", errors="ignore")
        #if w not in STOPWORDS and len(w) > 1:
        if len(w) > 1:
            #w = wordnet_lemmatizer.lemmatize(w)
            w = re.sub(num, "number", w)
            tmpDoc.append(w)
            frequency[w] += 1
    gen_docs.append(tmpDoc)
print len(gen_docs)

del df
print "Saving unfiltered gen"
dataSetName = path + dataSetName
np.save("%s_lemmaWords_noStop_subbedNums.npy" % dataSetName, gen_docs)
like image 456
Kevinj22 Avatar asked Oct 31 '25 06:10

Kevinj22


1 Answers

np.save first tries to convert the input into an array. After all it is designed to save numpy arrays.

If the resulting array is multidimensional with numeric or string values (dtype) it saves some basic dimension information, plus a memory copy of the arrays data buffer.

But if the array contains other objects (e.g. dtype object), then it pickles those objects, and saves the resulting string(s).

I would try

arr = np.array(gen_docs)

Does that produce a memory error?

If not, what is its shape and dtype?

If the tmpDoc (sublists) vary in length the arr will be a 1d array with object dtype - those objects being the tmpDoc lists.

Only if all the tmpDoc have the same length will it produce a 2d array. Even then the dtype will depend on the elements, whether numbers, strings, or other objects.

I might add that an array is pickled with the save protocol.

like image 122
hpaulj Avatar answered Nov 02 '25 19:11

hpaulj