Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word2vec training using gensim starts swapping after 100K sentences

I'm trying to train a word2vec model using a file with about 170K lines, with one sentence per line.

I think I may represent a special use case because the "sentences" have arbitrary strings rather than dictionary words. Each sentence (line) has about 100 words and each "word" has about 20 characters, with characters like "/" and also numbers.

The training code is very simple:

# as shown in http://rare-technologies.com/word2vec-tutorial/
import gensim, logging, os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

current_dir = os.path.dirname(os.path.realpath(__file__))

# each line represents a full chess match
input_dir = current_dir+"/../fen_output"
output_file = current_dir+"/../learned_vectors/output.model.bin"

sentences = MySentences(input_dir)

model = gensim.models.Word2Vec(sentences,workers=8)

Thing is, things work real quick up to 100K sentences (my RAM steadily going up) but then I run out of RAM and I can see my PC has started swapping, and training grinds to a halt. I don't have a lot of RAM available, only about 4GB and word2vec uses up all of it before starting to swap.

I think I have OpenBLAS correctly linked to numpy: this is what numpy.show_config() tells me:

blas_info:
  libraries = ['blas']
  library_dirs = ['/usr/lib']
  language = f77
lapack_info:
  libraries = ['lapack']
  library_dirs = ['/usr/lib']
  language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
  libraries = ['openblas']
  library_dirs = ['/usr/lib']
  language = f77
openblas_info:
  libraries = ['openblas']
  library_dirs = ['/usr/lib']
  language = f77
lapack_opt_info:
  libraries = ['lapack', 'blas']
  library_dirs = ['/usr/lib']
  language = f77
  define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

My question is: is this expected on a machine that hasn't got a lot of available RAM (like mine) and I should get more RAM or train the model in smaller pieces? or does it look like my setup isn't configured properly (or my code is inefficient)?

Thank you in advance.

like image 716
Felipe Avatar asked Jun 25 '15 23:06

Felipe


1 Answers

As a first principle, you should always get more RAM, if your budget and machine can manage it. It saves so much time & trouble.

Second, it's unclear if you mean that on a dataset of more than 100K sentences, training starts to slow down after the first 100K sentences are encountered, or if you mean that using any dataset larger than 100K sentences experiences the slowdown. I suspect it's the latter, because...

The Word2Vec memory usage is a function of the vocabulary size (token count) – and not the total amount of data used to train. So you may want to use a larger min_count, to slim the tracked number of words, to cap the RAM usage during training. (Words not tracked by the model will be silently dropped during training, as if they weren't there – and doing that for rare words doesn't hurt much and sometimes even helps, by putting other words closer to each other.)

Finally, you may wish to avoid providing the corpus sentences in the constructor – which automtically scans and trains – and instead explicitly call the build_vocab() and train() steps yourself after model construction, to examine the state/size of the model and adjust your parameters as needed.

In particular, in the latest versions of gensim, you can also split the build_vocab(corpus) step up into three steps scan_vocab(corpus), scale_vocab(...), and finalize_vocab().

The scale_vocab(...) step can be called with a dry_run=True parameter that previews how large your vocabulary, subsampled corpus, and expected memory-usage will be after trying different values of the min_count and sample parameters. When you find values that seem manageable, you can call scale_vocab(...) with those chosen parameters, and without dry_run, to apply them to your model (and then finalize_vocab() to initialize the large arrays).

like image 156
gojomo Avatar answered Oct 31 '22 07:10

gojomo