I'm training some models with my own NER pipe. I need to run spacy in lxc container so I can run it with python3.6 (which allow multi thread on training).
But.. on my 7 core authorized to run on my container only 1 run at 100% others run at 40-60% (actually they start at 100% but decrease after fews minutes). I would really like to improve this % core usage. Any idea to where to look ? Could it be a problem of Producer / Consumer ?
Env:
- spaCy version 2.0.8
- Location /root/.env/lib/python3.6/site-packages/spacy
- Platform Linux-3.14.32-xxxx-grs-ipv6-64-x86_64-with-debian-buster-sid
- Python version 3.6.4
The only things that are multi-threaded are the matrix multiplications, which in v2.0.8 are done via numpy, which delegates them to a BLAS library. Everything else is single-threaded.
You should check what BLAS library your numpy is linked to, and also make sure that the library has been compiled appropriately for your machine. On my machine the numpy I install via pip comes with a copy of OpenBLAS that thinks my machine has a Prescott CPU. This prevents it from using AVX instructions. So if I install default numpy from pip on my machine, it runs 2-3x slower than it should.
Another problem is that OpenBLAS might be launching more threads than it should. This seems especially common in containers.
Finally, the efficiency of parallelism very much depends on batch-size. On small batches, the matrices are small and the per-update routines such as the Adam optimiser take more of the time.
I usually disable multi-threading and train on a single core, because this is the most efficient (in the sense of dollars-for-work) --- I then have more models training as separate processes (usually on separate GCE VMs).
When writing spaCy I haven't assumed that the goal is to use lots of cores. The goal is efficiency. It's not a virtue to use your whole machine to perform the same work that could be done on a single core. A lot of papers are very misleading in this respect. For instance, it might feel satisfying to launch 12 training processes across a cloud and optimize using an asynchronous SGD strategy such as Hogwild!. This is an efficient way to burn up a bunch of energy, but doesn't necessarily train your models any faster: using Adam and smaller batch sizes, training is more stable and often reaches the same accuracy in many fewer iterations. Similarly, we can make the network larger so the machines get their workout...But why? The goal is to train the model. Multiplying a bunch of matrices is a means, not an end.
The problem I've been most concerned with is the terrible BLAS linkage situation. This will be much improved in v2.1, as we'll be bringing our own OpenBLAS kernel. The kernel will be single-threaded by default.
A simple thing to try if you suspect your BLAS is bad is to try installing numpy using conda. That will give you a copy linked against intel's MKL library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With