Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Segmentation fault in __pthread_getspecific called from libcuda.so.1

Problem: Segmentation fault (SIGSEGV, signal 11)

Brief program description:

  • high performance gpu (CUDA) server handling requests from remote clients
  • each incoming request spawns a thread that performs calculations on multiple GPU's (serial, not in parallel) and sends back a result to the client, this usually takes anywhere between 10-200ms as each request consists of tens or hundreds of kernel calls
  • request handler threads have exclusive access to GPU's, meaning that if one thread is running something on GPU1 all others will have to wait until its done
  • compiled with -arch=sm_35 -code=compute_35
  • using CUDA 5.0
  • i'm not using any CUDA atomics explicitly or any in-kernel synchronization barriers, though i'm using thrust (various functions) and cudaDeviceSynchronize() obviously
  • Nvidia driver: NVIDIA dlloader X Driver 313.30 Wed Mar 27 15:33:21 PDT 2013

OS and HW info:

  • Linux lub1 3.5.0-23-generic #35~precise1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
  • GPU's: 4x GPU 0: GeForce GTX TITAN
  • 32 GB RAM
  • MB: ASUS MAXIMUS V EXTREME
  • CPU: i7-3770K

Crash information:

Crash occurs "randomly" after a couple of thousands requests are handled (sometimes sooner, sometimes later). Stack traces from some of the crashes look like this:

#0  0x00007f8a5b18fd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1  0x00007f8a5a0c0cf3 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f8a59ff7b30 in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f8a59fcc34a in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f8a5ab253e7 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5  0x00007f8a5ab484fa in cudaGetDevice () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6  0x000000000046c2a6 in thrust::detail::backend::cuda::arch::device_properties() ()


#0  0x00007ff03ba35d91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1  0x00007ff03a966cf3 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007ff03aa24f8b in ?? () from /usr/lib/libcuda.so.1
#3  0x00007ff03b3e411c in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#4  0x00007ff03b3dd4b3 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5  0x00007ff03b3d18e0 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6  0x00007ff03b3fc4d9 in cudaMemset () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7  0x0000000000448177 in libgbase::cudaGenericDatabase::cudaCountIndividual(unsigned int, ...


#0  0x00007f01db6d6153 in ?? () from /usr/lib/libcuda.so.1
#1  0x00007f01db6db7e4 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f01db6dbc30 in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f01db6dbec2 in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f01db6c6c58 in ?? () from /usr/lib/libcuda.so.1
#5  0x00007f01db6c7b49 in ?? () from /usr/lib/libcuda.so.1
#6  0x00007f01db6bdc22 in ?? () from /usr/lib/libcuda.so.1
#7  0x00007f01db5f0df7 in ?? () from /usr/lib/libcuda.so.1
#8  0x00007f01db5f4e0d in ?? () from /usr/lib/libcuda.so.1
#9  0x00007f01db5dbcea in ?? () from /usr/lib/libcuda.so.1
#10 0x00007f01dc11e0aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#11 0x00007f01dc1466dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#12 0x0000000000472373 in thrust::detail::backend::cuda::detail::b40c_thrust::BaseRadixSortingEnactor


#0  0x00007f397533dd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1  0x00007f397426ecf3 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f397427baec in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f39741a9840 in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f39741add08 in ?? () from /usr/lib/libcuda.so.1
#5  0x00007f3974194cea in ?? () from /usr/lib/libcuda.so.1
#6  0x00007f3974cd70aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7  0x00007f3974cff6dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#8  0x000000000046bf26 in thrust::detail::backend::cuda::detail::checked_cudaMemcpy(void*

As you can see, usually it ends up in __pthread_getspecific called from libcuda.so or somewhere in the library itself. As far as i remember there has been just one case where it did not crash but instead it hanged in a strange way: the program was able to respond to my requests if they did not involve any GPU computation (statistics etc.), but otherwise i never got a reply. Also, doing nvidia-smi -L did not work, it just hung there until i rebooted the computer. Looked to me like a GPU deadlock sort of. This might be a completely different issue than this one though.

Does anyone have a clue where the problem might be or what could cause this?

Updates:

Some additional analysis:

  • cuda-memcheck does not print any error messages.
  • valgrind - leak check does print quite a few messages, like those below (there are hundreds like that):
==2464== 16 bytes in 1 blocks are definitely lost in loss record 6 of 725
==2464==    at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2464==    by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464==    by 0x56B859D: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464==    by 0x5050C82: __nptl_deallocate_tsd (pthread_create.c:156)
==2464==    by 0x5050EA7: start_thread (pthread_create.c:315)
==2464==    by 0x6DDBCBC: clone (clone.S:112)
==2464==
==2464== 16 bytes in 1 blocks are definitely lost in loss record 7 of 725
==2464==    at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2464==    by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464==    by 0x56B86D8: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464==    by 0x5677E0F: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464==    by 0x400F90D: _dl_fini (dl-fini.c:254)
==2464==    by 0x6D23900: __run_exit_handlers (exit.c:78)
==2464==    by 0x6D23984: exit (exit.c:100)
==2464==    by 0x6D09773: (below main) (libc-start.c:258)

==2464== 408 bytes in 3 blocks are possibly lost in loss record 222 of 725
==2464==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2464==    by 0x5A89B98: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x5A8A1F2: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x5A8A3FF: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x5B02E34: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x5AFFAA5: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x5AAF009: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x5A7A6D3: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x59B205C: ??? (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x5984544: cuInit (in /usr/lib/libcuda.so.313.30)
==2464==    by 0x568983B: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464==    by 0x5689967: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)

More information:

I have tried running on fewer cards (3, as that is the minimum needed for the program) and the crash still occurs.

The above is not true, i misconfigured the application and it used all four cards. Re-running the experiments with really just 3 cards seems to resolve the problem, it is now running for several hours under heavy load without crashes. I will now try to let it run a bit more and maybe then attempt to use a different subset of 3 cards to verify this and at the same time test if the problem is related to one particular card or not.

I monitored GPU temperature during the test runs and there does not seem to be anything wrong. The cards get up to about 78-80 °C under highest load with fan going at about 56% and this stays until the crash happens (several minutes), does not seem to be too high to me.

One thing i have been thinking about is the way the requests are handled - there is quite a lot of cudaSetDevice calls, since each request spawns a new thread (i'm using mongoose library) and then this thread switches between cards by calling cudaSetDevice(id) with appropriate device id. The switching can happen multiple times during one request and i am not using any streams (so it all goes to the default (0) stream IIRC). Can this somehow be related to the crashes occuring in pthread_getspecific ?

I have also tried upgrading to the latest drivers (beta, 319.12) but that didn't help.

like image 600
PeterK Avatar asked Apr 23 '13 08:04

PeterK


1 Answers

If you can identify 3 cards that work, try cycling the 4th card in place of one of the 3, and see if you get the failures again. This is just standard troubleshooting I think. If you can identify a single card that, when included in a group of 3, still elicits the issue, then that card is suspect.

But, my suggestion to run with fewer cards was also based on the idea that it may reduce the overall load on the PSU. Even at 1500W, you may not have enough juice. So if you cycle the 4th card in, in place of one of the 3 (i.e. still keep only 3 cards in the system or configure your app to use 3) and you get no failures, the problem may be due to overall power draw with 4 cards.

Note that the power consumption of the GTX Titan at full load can be on the order of 250W or possibly more. So it might seem that your 1500W PSU should be fine, but it may come down to a careful analysis of how much DC power is available on each rail, and how the motherboard and PSU harness is distributing the 12V DC rails to each GPU.

So if reducing to 3GPUs seems to fix the problem no matter which 3 you use, my guess is that your PSU is not up to the task. Not all 1500W is available from a single DC rail. The 12V "rail" is actually composed of several different 12V rails, each of which delivers a certain portion of the overall 1500W. So even though you may not be pulling 1500W, you can still overload a single rail, depending on how the GPU power is connected to the rails.

I agree that temperatures in the 80C range should be fine, but that indicates (approximately) a fully loaded GPU, so if you're seeing that on all 4 GPUs at once, then you are pulling a heavy load.

like image 52
Robert Crovella Avatar answered Sep 28 '22 18:09

Robert Crovella