I tried and tried to create a debug build for a recent version of Tensorflow , using the official docker images (latest-cuda-devel-py3 -> r1.12.0) but nothing seems to work. Has someone recently created a successful debug build for Tensorflow (>= r1.11.0) and can share his approach ?
This is what I tried so far.
I basically tried to follow the instructions at https://www.tensorflow.org/install/source, but tried to modify them to generate a debug build. Nothing I tried resulted in a successful build.
The Host System is a Linux x86-64 machine with lots of RAM (e.g. 512 GB of RAM -> DGX-1). The CUDA Version within the Docker-Image is CUDA-9.0. The recent "latest" Tensorflow Version which is inside the docker image is r1.12.0
In order to get any cuda-build working, I needed to use "nvidia-docker", otherwise I get a linker error with "libcuda.so.1".
I started like this:
nvidia-docker pull tensorflow/tensorflow:latest-devel-gpu-py3
nvidia-docker run --runtime=nvidia -it -w /tensorflow -v $PWD:/mnt -e HOST_PERMS="$(id -u):$(id -g)" \
tensorflow/tensorflow:latest-devel-gpu-py3 bash
Then I tried to configure the project using
cd /tensorflow
./configure
I tried various configs. I tried keeping all values at their defaults. I tried enabling only the parts which I need. I tried not running ./configure at all. I pointed it to my own cuda-9.0 and tensorrt installtion. But not running ./configure at all (in the docker image) seems to produce best results (e.g. I can do optimized builds successfully with least effort).
If I build it using the exact official build instructions, i.e. creating an optimized/non-debug build, everything works as expected. So running the following seems to succeed.
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
Same thing, if I run the following, which includes debug info, but does not turn off optimization (e.g. I cannot really use this for debug purposes).
bazel build --config cuda --strip=never -c opt --copt="-ggdb" //tensorflow/tools/pip_package:build_pip_package
But everything which disables optimizations does not seem to work. If I run the following (with or without the --strip=never flag )
bazel build --config cuda --strip=never -c dbg
//tensorflow/tools/pip_package:build_pip_package
I arrive at the following error:
INFO: From Compiling tensorflow/contrib/framework/kernels/zero_initializer_op_gpu.cu.cc: external/com_google_absl/absl/strings/string_view.h(496): error: constexpr function return is non-constant
Which can be resolved by defining -DNDEBUG (see nvcc error: string_view.h: constexpr function return is non-constant ).
But If I run the following:
bazel build --config cuda --strip=never -c dbg --copt="-DNDEBUG" //tensorflow/tools/pip_package:build_pip_package
I get these linking errors at the final step of the build:
ERROR: /tensorflow/python/BUILD:3865:1: Linking of rule '//tensorflow/python:_pywrap_tensorflow_internal.so' failed (Exit 1) /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crti.o: In function
_init': (.init+0x7): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbolgmon_start' /usr/lib/gcc/x86_64-linux-gnu/5/crtbeginS.o: In functionderegister_tm_clones': crtstuff.c:(.text+0x3): relocation truncated to fit: R_X86_64_PC32 against.tm_clone_table' crtstuff.c:(.text+0xa): relocation truncated to fit: R_X86_64_PC32 against symbol__TMC_END__' defined in .nvFatBinSegment section in bazel-out/k8-dbg/bin/tensorflow/python/_pywrap_tensorflow_internal.so crtstuff.c:(.text+0x1e): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol_ITM_deregisterTMCloneTable' /usr/lib/gcc/x86_64-linux-gnu/5/crtbeginS.o: In functionregister_tm_clones': crtstuff.c:(.text+0x43): relocation truncated to fit: R_X86_64_PC32 against.tm_clone_table' crtstuff.c:(.text+0x4a): relocation truncated to fit: R_X86_64_PC32 against symbol__TMC_END__' defined in .nvFatBinSegment section in bazel-out/k8-dbg/bin/tensorflow/python/_pywrap_tensorflow_internal.so crtstuff.c:(.text+0x6b): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol_ITM_registerTMCloneTable' /usr/lib/gcc/x86_64-linux-gnu/5/crtbeginS.o: In function__do_global_dtors_aux': crtstuff.c:(.text+0x92): relocation truncated to fit: R_X86_64_PC32 against.bss' crtstuff.c:(.text+0x9c): relocation truncated to fit: R_X86_64_GOTPCREL against symbol__cxa_finalize@@GLIBC_2.2.5' defined in .text section in /lib/x86_64-linux-gnu/libc.so.6 crtstuff.c:(.text+0xaa): relocation truncated to fit: R_X86_64_PC32 against symbol__dso_handle' defined in .data.rel.local section in /usr/lib/gcc/x86_64-linux-gnu/5/crtbeginS.o crtstuff.c:(.text+0xbb): additional relocation overflows omitted from the output bazel-out/k8-dbg/bin/tensorflow/python/_pywrap_tensorflow_internal.so: PC-relative offset overflow in GOT PLT entry for `_ZNK5Eigen10TensorBaseINS_9TensorMapINS_6TensorIKjLi1ELi1EiEELi16ENS_11MakePointerEEELi0EE9unaryExprINS_8internal11scalar_leftIjjN10tensorflow7functor14right_shift_opIjEEEEEEKNS_18TensorCwiseUnaryOpIT_KS6_EERKSH_' collect2: error: ld returned 1 exit status Target //tensorflow/tools/pip_package:build_pip_package failed to build
I hoped to be able to solve that by doing a monolithic build. So I tried that, and got essentially the same error.
bazel build --config cuda -c dbg --config=monolithic --copt="-DNDEBUG" //tensorflow/tools/pip_package:build_pip_package
I also tried the approaches from TensorFlow doesnt build with debug mode and several other variants I found by extensive googling. I'm running out of options.
I'd take any Tensorflow version from 1.11 onwards, including (working) nightly builds. It just needs to work with CUDA 9 on x86 linux, include debug symbols and disabled optimizations.
thank you very much in Advance..
Just in case someone else stumbles over this problem. I finally got it to compile, using the following command:
bazel build --config cuda --strip=never --copt="-DNDEBUG" --copt="-march=native" --copt="-Og" --copt="-g3" --copt="-mcmodel=medium" --copt="-fPIC" //tensorflow/tools/pip_package:build_pip_package
After that, installation is a bit of a hazzle, since the wheel cannot be built anymore. But the tensorflow build can be installed anyway:
When building the wheel, via
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
The process fails with an error which seems to be a problem with python's builtin zip compression library (i.e. it cannot compress the resulting archive, since it's too large).
It's important to run it anyway, since it only fails at the final step (archiving). When running build_pip_package, it prints to the console right at the start of the process, that it's building the package in a temporary directory (say, /tmp/Shjwejweu ) - the contents of that temp directory can be used to install tf debug version. Simply copy it to the target machine, then make sure you have any old tensorflow package removed (e.g. pip uninstall tensorflow), and run within:
python setup.py install
But be careful to actively uninstall the "tensorflow" package first, otherwise you can get two simultaneously installed tensorflow versions..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With