Registering a level triggered eventfd on epoll_ctl
only fires once, when not decrementing the eventfd counter. To summarize the problem, I have observed that the epoll flags (EPOLLET
, EPOLLONESHOT
or None
for level triggered behaviour) behave similar. Or in other words: Does not have an effect.
Could you confirm this bug?
I have an application with multiple threads. Each thread waits for new events with epoll_wait
with the same epollfd. If you want to terminate the application gracefully, all threads have to be woken up. My thought was that you use the eventfd counter (EFD_SEMAPHORE|EFD_NONBLOCK
) for this (with level triggered epoll behavior) to wake up all together. (Regardless of the thundering herd problem for a small number of filedescriptors.)
E.g. for 4 threads you write 4 to the eventfd. I was expecting epoll_wait
returns immediately and again and again until the counter is decremented (read) 4 times. epoll_wait
only returns once for every write.
Yep, I read all related manuals carefully ;)
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>
static int event_fd = -1;
static int epoll_fd = -1;
void *thread(void *arg)
{
(void) arg;
for(;;) {
struct epoll_event event;
epoll_wait(epoll_fd, &event, 1, -1);
/* handle events */
if(event.data.fd == event_fd && event.events & EPOLLIN) {
uint64_t val = 0;
eventfd_read(event_fd, &val);
break;
}
}
return NULL;
}
int main(void)
{
epoll_fd = epoll_create1(0);
event_fd = eventfd(0, EFD_SEMAPHORE| EFD_NONBLOCK);
struct epoll_event event;
event.events = EPOLLIN;
event.data.fd = event_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &event);
enum { THREADS = 4 };
pthread_t thrd[THREADS];
for (int i = 0; i < THREADS; i++)
pthread_create(&thrd[i], NULL, &thread, NULL);
/* let threads park internally (kernel does readiness check before sleeping) */
usleep(100000);
eventfd_write(event_fd, THREADS);
for (int i = 0; i < THREADS; i++)
pthread_join(thrd[i], NULL);
}
epoll provides both edge-triggered and level-triggered modes. In edge-triggered mode, a call to epoll_wait will return only when a new event is enqueued with the epoll object, while in level-triggered mode, epoll_wait will return as long as the condition holds.
As its return value, eventfd() returns a new file descriptor that can be used to refer to the eventfd object. The following values may be bitwise ORed in flags to change the behavior of eventfd(): EFD_CLOEXEC (since Linux 2.6. 27) Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor.
The main difference between epoll and select is that in select() the list of file descriptors to wait on only exists for the duration of a single select() call, and the calling task only stays on the sockets' wait queues for the duration of a single call.
When you write to an eventfd
, a function eventfd_signal
is called. It contains the following line which does the wake up:
wake_up_locked_poll(&ctx->wqh, EPOLLIN);
With wake_up_locked_poll
being a macro:
#define wake_up_locked_poll(x, m) \
__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
With __wake_up_locked_key
being defined as:
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
__wake_up_common(wq_head, mode, 1, 0, key, NULL);
}
And finally, __wake_up_common
is being declared as:
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
* number) then we wake all the non-exclusive tasks and one exclusive task.
*
* There are circumstances in which we can try to wake a task which has already
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key,
wait_queue_entry_t *bookmark)
Note the nr_exclusive
argument and you will see that writing to an eventfd
wakes only one exclusive waiter.
What does exclusive mean? Reading epoll_ctl
man page gives us some insight:
EPOLLEXCLUSIVE (since Linux 4.5):
Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using
EPOLLEXCLUSIVE
, one or more of the epoll file descriptors will receive an event withepoll_wait(2)
.
You do not use EPOLLEXCLUSIVE
when adding your event, but to wait with epoll_wait
every thread has to put itself to a wait queue. Function do_epoll_wait
performs the wait by calling ep_poll
. By following the code you can see that it adds the current thread to a wait queue at line #1903:
__add_wait_queue_exclusive(&ep->wq, &wait);
Which is the explanation for what is going on - epoll waiters are exclusive, so only a single thread is woken up. This behavior has been introduced in v2.6.22-rc1 and the relevant change has been discussed here.
To me this looks like a bug in the eventfd_signal
function: in semaphore mode it should perform a wake-up with nr_exclusive
equal to the value written.
So your options are:
poll
, probably on both eventfd
and epollevenfd_write
4 times (probably the best you can do).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With