Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Failed to attach to any of the Graphcore IPU devices when running simple TensorFlow code example

Tags:

tensorflow

ipu

I've tried running one of Graphcore's GitHub code examples, the Tensorflow simple replication one following the README with --replication-factor 16, and the following error was thrown:

tensorflow.python.framework.errors_impl.InternalError: Failed to attach to any of the device(s) with matching configs for ordinal 0 

I'm not sure why it's failing to attach: I've tried to use gc-info -l as debugging tool, and it correctly shows all the IPU configurations available on the chassis. It was working fine before, and it seems to be quite temperamental. I've tried rebooting but the error shows up randomly after a while again. Any help would be much appreciated.

like image 320
Odysseo Avatar asked May 12 '20 14:05

Odysseo


Video Answer


1 Answers

This failure might be caused by the IPUs being busy running other processes or by an incorrect environment configuration.

1. The IPUs are busy

When you execute a Poplar program (or a framework specific model utilising IPU libraries) you request a certain number of IPUs. If, for instance, you request to run a program with 2 IPUs but somebody else is already using all the IPUs on a chassis, then your program will fail to attach and throw a similar error to the one you’ve seen. For this scenario, you should simply wait until the desired number of IPUs are available. You can verify whether the devices are busy using gc-monitor command line tool (see for reference IPU Command Line tools guide). This is what a busy machine looks like:

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-----------------+
|                                                                                                        Attached processes                                                                                                         |          IPU           |      Board      |
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+------------+----+----------+--------+--------+--------+
|  PID   |                                                                                              Command                                                                                               |  Time  |    User    | ID |  Clock   |  Temp  |  Temp  | Power  |
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+------------+----+----------+--------+--------+--------+
| 32778  |                                                                                               python                                                                                               | 7m34s  |  User_Name  | 0  | 1300MHz  | 37.1 C | 41.5 C |104.7 W |
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+------------+----+----------+--------+--------+--------+

This is what an idle machine looks like:

+--------------------------------------------------------------------------------------------------+
|                                      No attached processes                                       |
+--------------------------------------------------------------------------------------------------+

2. gc-driver is not activated

You can check if the driver has been activated by running gc-info -l. If it is not then gc-info will output:

gc-info: command not found

Otherwise if you are running e.g. a TensorFlow application, you might run into the following error (or similar):

tensorflow.python.framework.errors_impl.InvalidArgumentError: Target configuration failed: model disabled and no hardware IPU found. (Are you sure you enabled the Poplar driver?) 

On the other hand, if the driver is activated, gc-info -l output would typically list all the IPUs available in your hardware platform.

To activate gc-driver you should make sure to source the gc-driver enable script as follows:

source <path_to_sdk>/gc_drivers-ubuntu_<ubuntu_version>-<sdk_version> <hash>/enable.sh 

In your case, gc-info -l is working fine, therefore you rather seem to be hitting case 1.

3. gc-driver is not installed

To check if gc-driver is installed correctly you can run:

$ modinfo ipu_driver 

This should output something similar to your console:

filename:       /lib/modules/4.15.0-58-generic/updates/dkms/ipu_driver.ko 
version:        1.0.41 

like image 96
Mimi Lm Avatar answered Oct 17 '22 15:10

Mimi Lm