Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen

I am using tensor flow version :

0.12.1

Cuda tool set version is 8.

lrwxrwxrwx  1 root root   19 May 28 17:27 cuda -> /usr/local/cuda-8.0 

As documented here I have downloaded and installed cuDNN. But while execeting following line from my python script I am getting error messages mentioned in header:

  model.fit_generator(train_generator,    steps_per_epoch= len(train_samples),    validation_data=validation_generator,     validation_steps=len(validation_samples),    epochs=9) 

Detailed error message is as follows:

Using TensorFlow backend.  I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally  I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally  I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally  I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally  I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally  Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last):   File " lib/python3.5/threading.py", line 914, in _bootstrap_inner     self.run()   File " lib/python3.5/threading.py", line 862, in run     self._target(*self._args, **self._kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task     generator_output = next(self._generator) StopIteration  I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),   but there must be at least one NUMA node, so returning NUMA node zero  I tensorflow/core/common_runtime/gpu/gpu_device.cc:885]  Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory: 3.91GiB  I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0  I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y  I tensorflow/core/common_runtime/gpu/gpu_device.cc:975]   Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)  Traceback (most recent call last):   File "model_new.py", line 82, in <module>     model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper     return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator     initial_epoch=initial_epoch)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper     return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator     class_weight=class_weight)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch     outputs = self.train_function(ins)   File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__     feed_dict=feed_dict)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run     run_metadata_ptr)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run     np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)   File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray     return array(a, dtype, copy=False, order=order) MemoryError 

If any suggestion to resolve this error is appreciated.

EDIT: Issue is fatal.

uname -a Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux  sudo lshw -short [sudo] password for carnd: H/W path    Device  Class      Description ==========================================                     system     HVM domU /0                  bus        Motherboard /0/0                memory     96KiB BIOS /0/401              processor  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz /0/402              processor  CPU /0/403              processor  CPU /0/404              processor  CPU /0/405              processor  CPU /0/406              processor  CPU /0/407              processor  CPU /0/408              processor  CPU /0/1000             memory     15GiB System Memory /0/1000/0           memory     15GiB DIMM RAM /0/100              bridge     440FX - 82441FX PMC [Natoma] /0/100/1            bridge     82371SB PIIX3 ISA [Natoma/Triton II] /0/100/1.1          storage    82371SB PIIX3 IDE [Natoma/Triton II] /0/100/1.3          bridge     82371AB/EB/MB PIIX4 ACPI /0/100/2            display    GD 5446 /0/100/3            display    GK104GL [GRID K520] /0/100/1f           generic    Xen Platform Device /1          eth0    network    Ethernet interface 

EDIT 2:

This is an EC2 instance in Amazon cloud. And all the files holding value -1.

:/sys$ find . -name numa_node -exec cat '{}' \; find: ‘./fs/fuse/connections/39’: Permission denied -1 -1 -1 -1 -1 -1 -1 find: ‘./kernel/debug’: Permission denied 

EDIT3: After updating the numa_nod files NUMA related error is disappeared. But all other previous errors listed above is remaining. And again I got a fatal error.

Using TensorFlow backend. I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last):   File " lib/python3.5/threading.py", line 914, in _bootstrap_inner     self.run()   File " lib/python3.5/threading.py", line 862, in run     self._target(*self._args, **self._kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task     generator_output = next(self._generator) StopIteration  I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory: 3.91GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) Traceback (most recent call last):   File "model_new.py", line 85, in <module>     model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper     return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator     initial_epoch=initial_epoch)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper     return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator     class_weight=class_weight)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch     outputs = self.train_function(ins)   File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__     feed_dict=feed_dict)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run     run_metadata_ptr)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run     np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)   File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray     return array(a, dtype, copy=False, order=order) MemoryError 
like image 628
Steephen Avatar asked May 28 '17 23:05

Steephen


2 Answers

There is the code which prints the message "successful NUMA node read from SysFS had negative value (-1)", and it is not Fatal Error, it is just warning. Real error is MemoryError in your File "model_new.py", line 85, in <module>. We need more sources to check this error. Try to make your model smaller or run on server with more RAM.


About NUMA node warning:

https://github.com/tensorflow/tensorflow/blob/e4296aefff97e6edd3d7cee9a09b9dd77da4c034/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc#L855

// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out // of SysFS. Returns -1 if it cannot... static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal)  {...   string filename =       port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());   FILE *file = fopen(filename.c_str(), "r");   if (file == nullptr) {     LOG(ERROR) << "could not open file to read NUMA node: " << filename                << "\nYour kernel may have been built without NUMA support.";     return kUnknownNumaNode;   } ...   if (port::safe_strto32(content, &value)) {     if (value < 0) {  // See http://b/18228951 for details on this path.       LOG(INFO) << "successful NUMA node read from SysFS had negative value ("                 << value << "), but there must be at least one NUMA node"                             ", so returning NUMA node zero";       fclose(file);       return 0;     } 

TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node file where %s is id of GPU PCI card (string pci_bus_id = CUDADriver::GetPCIBusID(device_)). Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be '0' (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1 value in this file!

So, we know that sysfs is mounted into /sys, there is numa_node special file, CONFIG_NUMA is enabled in your Linux Kernel config (zgrep NUMA /boot/config* /proc/config*). Actually it is enabled: CONFIG_NUMA=y - in the deb of your x86_64 4.4.0-78-generic kernel

The special file numa_node is documented in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci (is the ACPI of your PC wrong?)

What:       /sys/bus/pci/devices/.../numa_node Date:       Oct 2014 Contact:    Prarit Bhargava <[email protected]> Description:         This file contains the NUMA node to which the PCI device is         attached, or -1 if the node is unknown.  The initial value         comes from an ACPI _PXM method or a similar firmware         source.  If that is missing or incorrect, this file can be         written to override the node.  In that case, please report         a firmware bug to the system vendor.  Writing to this file         taints the kernel with TAINT_FIRMWARE_WORKAROUND, which         reduces the supportability of your system. 

There is quick (kludge) workaround for this error: find the numa_node of your GPU and with root account do after every boot this command where NNNNN is the PCI id of your card (search in lspci output and in /sys/bus/pci/devices/ directory)

echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node 

Or just echo it into every such file, it should be rather safe:

for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done 

Also your lshw shows that it is not PC, but Xen virtual guest. There is something wrong between Xen platform (ACPI) emulation and Linux PCI bus NUMA-support code.

like image 76
osgx Avatar answered Oct 02 '22 12:10

osgx


This amends the accepted answer:

Annoyingly, the numa_node setting is reset (to the value -1) for every time the system is rebooted. To fix this more persistently, you can create a crontab (as root).

The following steps worked for me:

# 1) Identify the PCI-ID (with domain) of your GPU #    For example: PCI_ID="0000.81:00.0" lspci -D | grep NVIDIA # 2) Add a crontab for root sudo crontab -e #    Add the following line @reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node") 

This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.

Again, keep in mind that this is only a "shallow" fix as the Nvidia driver is unaware of it:

nvidia-smi topo -m #       GPU0  CPU Affinity  NUMA Affinity # GPU0     X  0-127         N/A 
like image 23
normanius Avatar answered Oct 02 '22 14:10

normanius