I have a fairly large list of lists representing the tokens in the Sogou text classification data set. I can process the entire training set of 450 000 with 12 gigs of ram left over, but when I call numpy.save() on the list of lists the memory usage seems to double and I run out of memory.
Why is this? Does the numpy.save convert the list before saving but retain the list thus using more memory?
Is there an alternative way to save this list of lists i.e pickling? I believe numpy save uses the pickle protocol judging from the allow pickle argument: https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
print "Collecting Raw Documents, tokenize, and remove stop words"
df = pd.read_pickle(path + dataSetName + "Train")
frequency = defaultdict(int)
gen_docs = []
totalArts = len(df)
for artNum in range(totalArts):
    if artNum % 2500 == 0:
        print "Gen Docs Creation on " + str(artNum) + " of " + str(totalArts)
    bodyText = df.loc[artNum,"fullContent"]
    bodyText = re.sub('<[^<]+?>', '', str(bodyText))
    bodyText = re.sub(pun, " ", str(bodyText))
    tmpDoc = []
    for w in word_tokenize(bodyText):
        w = w.lower().decode("utf-8", errors="ignore")
        #if w not in STOPWORDS and len(w) > 1:
        if len(w) > 1:
            #w = wordnet_lemmatizer.lemmatize(w)
            w = re.sub(num, "number", w)
            tmpDoc.append(w)
            frequency[w] += 1
    gen_docs.append(tmpDoc)
print len(gen_docs)
del df
print "Saving unfiltered gen"
dataSetName = path + dataSetName
np.save("%s_lemmaWords_noStop_subbedNums.npy" % dataSetName, gen_docs)
                np.save first tries to convert the input into an array.  After all it is designed to save numpy arrays.
If the resulting array is multidimensional with numeric or string values (dtype) it saves some basic dimension information, plus a memory copy of the arrays data buffer.
But if the array contains other objects (e.g. dtype object), then it pickles those objects, and saves the resulting string(s).
I would try
arr = np.array(gen_docs)
Does that produce a memory error?
If not, what is its shape and dtype?
If the tmpDoc (sublists) vary in length the arr will be a 1d array with object dtype - those objects being the tmpDoc lists.
Only if all the tmpDoc have the same length will it produce a 2d array.  Even then the dtype will depend on the elements, whether numbers, strings, or other objects.
I might add that an array is pickled with the save protocol.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With