I am getting a data race when calling pthread_create() recursively. I don't know if the recursion causes the problem, but the race seems to never occur on the first iteration, mostly on the second and rarely on the third.
When using libgc, there are memory corruption symptoms, such as segmentation fault, that coincide with the the data race.
The following program is a minimal example that illustrates the problem. I'm not using libgc in the example as only the data race is the topic of this question.
The data race is visible when running Valgrind with the Helgrind tool. There are slight variations on the problems reported, including sometimes no problem at all.
I'm running Linux Mint 17.2. The version of gcc is (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4.
The following example, 'main.c', reproduces the problem. It iterates over a linked list, printing each elements value in a separate thread:
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
typedef struct List {
int head ;
struct List* tail ;
} List ;
// create a list element with an integer head and a tail
List* new_list( int head, List* tail ) {
List* l = (List*)malloc( sizeof( List ) ) ;
l->head = head ;
l->tail = tail ;
return l ;
}
// create a thread and start it
void call( void* (*start_routine)( void* arg ), void* arg ) {
pthread_t* thread = (pthread_t*)malloc( sizeof( pthread_t ) ) ;
if ( pthread_create( thread, NULL, start_routine, arg ) ) {
exit( -1 ) ;
}
pthread_detach( *thread ) ;
return ;
}
void print_list( List* l ) ;
// start routine for thread
void* print_list_start_routine( void* arg ) {
// verify that the list is not empty ( = NULL )
// print its head
// print the rest of it in a new thread
if ( arg ) {
List* l = (List*)arg ;
printf( "%d\n", l->head ) ;
print_list( l->tail ) ;
}
return NULL ;
}
// print elements of a list with one thread for each element printed
// threads are created recursively
void print_list( List* l ) {
call( print_list_start_routine, (void*)l ) ;
}
int main( int argc, const char* argv[] ) {
List* l = new_list( 1, new_list( 2, new_list( 3, NULL ) ) ) ;
print_list( l ) ;
// wait for all threads to finnish
pthread_exit( NULL ) ;
return 0 ;
}
Here is 'makefile':
CC=gcc
a.out: main.o
$(CC) -pthread main.o
main.o: main.c
$(CC) -c -g -O0 -std=gnu99 -Wall main.c
clean:
rm *.o a.out
Here is the most common output of Helgrind. Notice that the lines with only a single digit, 1, 2 and 3 are output of the program and not Helgrind:
$ valgrind --tool=helgrind ./a.out
==13438== Helgrind, a thread error detector
==13438== Copyright (C) 2007-2013, and GNU GPL'd, by OpenWorks LLP et al.
==13438== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==13438== Command: ./a.out
==13438==
1
2
==13438== ---Thread-Announcement------------------------------------------
==13438==
==13438== Thread #3 was created
==13438== at 0x515543E: clone (clone.S:74)
==13438== by 0x4E44199: do_clone.constprop.3 (createthread.c:75)
==13438== by 0x4E458BA: pthread_create@@GLIBC_2.2.5 (createthread.c:245)
==13438== by 0x4C30C90: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==13438== by 0x4007EB: call (main.c:25)
==13438== by 0x400871: print_list (main.c:58)
==13438== by 0x40084D: print_list_start_routine (main.c:48)
==13438== by 0x4C30E26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==13438== by 0x4E45181: start_thread (pthread_create.c:312)
==13438== by 0x515547C: clone (clone.S:111)
==13438==
==13438== ---Thread-Announcement------------------------------------------
==13438==
==13438== Thread #2 was created
==13438== at 0x515543E: clone (clone.S:74)
==13438== by 0x4E44199: do_clone.constprop.3 (createthread.c:75)
==13438== by 0x4E458BA: pthread_create@@GLIBC_2.2.5 (createthread.c:245)
==13438== by 0x4C30C90: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==13438== by 0x4007EB: call (main.c:25)
==13438== by 0x400871: print_list (main.c:58)
==13438== by 0x4008BB: main (main.c:66)
==13438==
==13438== ----------------------------------------------------------------
==13438==
==13438== Possible data race during write of size 1 at 0x602065F by thread #3
==13438== Locks held: none
==13438== at 0x4C368F5: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==13438== by 0x4012CD6: _dl_allocate_tls_init (dl-tls.c:436)
==13438== by 0x4E45715: pthread_create@@GLIBC_2.2.5 (allocatestack.c:252)
==13438== by 0x4C30C90: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==13438== by 0x4007EB: call (main.c:25)
==13438== by 0x400871: print_list (main.c:58)
==13438== by 0x40084D: print_list_start_routine (main.c:48)
==13438== by 0x4C30E26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==13438== by 0x4E45181: start_thread (pthread_create.c:312)
==13438== by 0x515547C: clone (clone.S:111)
==13438==
==13438== This conflicts with a previous read of size 1 by thread #2
==13438== Locks held: none
==13438== at 0x51C10B1: res_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==13438== by 0x51C1061: __libc_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==13438== by 0x4E45199: start_thread (pthread_create.c:329)
==13438== by 0x515547C: clone (clone.S:111)
==13438==
3
==13438==
==13438== For counts of detected and suppressed errors, rerun with: -v
==13438== Use --history-level=approx or =none to gain increased speed, at
==13438== the cost of reduced accuracy of conflicting-access information
==13438== ERROR SUMMARY: 8 errors from 1 contexts (suppressed: 56 from 48)
As mentioned by Pooja Nilangekar, replacing pthread_detach() with pthread_join() removes the race. However, detaching the threads is a requirement so the goal is to cleanly detach the threads. In other words, keep the pthread_detach() while removing the race.
There seems to be some unintended sharing between the threads. The unintended sharing may be related to what is discussed here: http://www.domaigne.com/blog/computing/joinable-and-detached-threads/ Especially the bug in the example.
I still don't understand what is really going on.
The output of helgrind does not match up your source. According to helgrind, in line 25 there is a pthread_create call, but all I see is exit(-1). I assume you forgot to add a line at the beginning of the source.
That being said, I cannot reproduce the output of helgrind at all. I have run your program in a while loop, hoping to get the same error, but nada. That's the nasty thing about races - you never know when they occur, and they are hard to track.
Then there's another thing: res_thread_freeres is called whenever resolver state informations (DNS) are going to be freed. Actually, it's called without even being checked. And _dl_allocate_tls_init is used for Thread Local Storage (TLS) and ensures that certain resources and metadata (custom stack, cleanup information, etc.) are allocated/stored before your function is given control to the thread.
That suggests that there is a race between creating a new thread and killing an old one. Since you detach your threads, it's possible that a parent thread dies before the child finishes. In that case, syncing the exiting of the threads (Pooja Nilangekar pointed out that this can be done by joining them) might solve the issue, as pthread_join stalls until a thread finishes, thus syncing child/parent deallocation.
What you could do if you still want to go for parallelism is that you take care of the memory by yourself. See pthread_attr_setstack specifically here. Since I cannot reproduce the error, I haven't made sure if this really works.
Also, this approach requires you to know the amount of threads you are going to have. If you try to reallocate memory that is currently used by threads, you are playing with fire.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With