I have seen a lot of comparisons which says select have to walk through the fd list, and this is slow. But why doesn't epoll have to do this?
poll( ) is more efficient for large-valued file descriptors. Imagine watching a single file descriptor with the value 900 via select()—the kernel would have to check each bit of each passed-in set, up to the 900th bit.
Both of them potentially trigger on a single event. However, with poll, the user then has no choice but to iterate thru the entire list submitted to find the events, whereas with epoll you get a list returned containing only the actual events.
epoll stands for event poll and is a Linux specific construct. It allows for a process to monitor multiple file descriptors and get notifications when I/O is possible on them. It allows for both edge-triggered as well as level-triggered notifications.
Unlike select and poll both of which only provide one API, epoll is not a single API but a group of 3 APIs. epoll_create and epoll_add are called to set up the epoll instance while epoll_wait can be called in a loop to constantly wait on the fds added by epoll_add . The complexity of the inner loop is O(ready fds) .
There's a lot of misinformation about this, but the real reason is this:
A typical server might be dealing with, say, 200 connections. It will service every connection that needs to have data written or read and then it will need to wait until there's more work to do. While it's waiting, it needs to be interrupted if data is received on any of those 200 connections.
With select
, the kernel has to add the process to 200 wait lists, one for each connection. To do this, it needs a "thunk" to attach the process to the wait list. When the process finally does wake up, it needs to be removed from all 200 wait lists and all those thunks need to be freed.
By contrast, with epoll
, the epoll
socket itself has a wait list. The process needs to be put on only that one wait list using only one thunk. When the process wakes up, it needs to be removed from only one wait list and only one thunk needs to be freed.
To be clear, with epoll
, the epoll
socket itself has to be attached to each of those 200 connections. But this is done once, for each connection, when it is accepted in the first place. And this is torn down once, for each connection, when it is removed. By contrast, each call to select
that blocks must add the process to every wait queue for every socket being monitored.
Ironically, with select
, the largest cost comes from checking if sockets that have had no activity have had any activity. With epoll
, there is no need to check sockets that have had no activity because if they did have activity, they would have informed the epoll
socket when that activity happened. In a sense, select
polls each socket each time you call select
to see if there's any activity while epoll
rigs it so that the socket activity itself notifies the process.
The main difference between epoll
and select
is that in select()
the list of file descriptors to wait on only exists for the duration of a single select()
call, and the calling task only stays on the sockets' wait queues for the duration of a single call. In epoll
, on the other hand, you create a single file descriptor that aggregates events from multiple other file descriptors you want to wait on, and so the list of monitored fd's is long-lasting, and tasks stay on socket wait queues across multiple system calls. Furthermore, since an epoll
fd can be shared across multiple tasks, it is no longer a single task on the wait queue, but a structure that itself contains another wait queue, containing all processes currently waiting on the epoll
fd. (In terms of implementation, this is abstracted over by the sockets' wait queues holding a function pointer and a void*
data pointer to pass to that function).
So, to explain the mechanics a little more:
epoll
file descriptor has a private struct eventpoll
that keeps track of which fd's are attached to this fd. struct eventpoll
also has a wait queue that keeps track of all processes that are currently epoll_wait
ing on this fd. struct epoll
also has a list of all file descriptors that are currently available for reading or writing.epoll
fd using epoll_ctl()
, epoll
adds the struct eventpoll
to that fd's wait queue. It also checks if the fd is currently ready for processing and adds it to the ready list, if so.epoll
fd using epoll_wait
, the kernel first checks the ready list, and returns immediately if any file descriptors are already ready. If not, it adds itself to the single wait queue inside struct eventpoll
, and goes to sleep.epoll()
ed, it calls the epoll
callback, which adds the file descriptor to the ready list, and also wakes up any waiters that are currently waiting on that struct eventpoll
.Obviously, a lot of careful locking is needed on struct eventpoll
and the various lists and wait queues, but that's an implementation detail.
The important thing to note is that at no point above there did I describe a step that loops over all file descriptors of interest. By being entirely event-based and by using a long-lasting set of fd's and a ready list, epoll can avoid ever taking O(n) time for an operation, where n is the number of file descriptors being monitored.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With