Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using select/poll/kqueue/kevent to watch a directory for new files

In my app I need to watch a directory for new files. The amount of traffic is very large and there are going to be a minimum of hundreds of new files per second appearing. Currently I'm using a busy loop with this kind of idea:

while True:
  time.sleep(0.2)
  if len(os.listdir('.')) > 0:
    # do stuff

After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.

I'm trying to use one of the available classes in select to poll my directory, but I'm not sure if it actually works, or if I'm just doing it wrong.

I get an fd for my directory with:

fd = os.open('.', os.O_DIRECT)

I've then tried several methods to see when the directory changes. As an example, one of the things I tried was:

poll = select.poll()
poll.register(fd, select.POLLIN)

poll.poll()  # returns (fd, 1) meaning 'ready to read'

os.read(fd, 4096) # prints largely gibberish but i can see that i'm pulling the files/folders contained in the directory at least

poll.poll()  # returns (fd, 1) again

os.read(fd, 4096) # empty string - no more data

Why is poll() acting like there is more information to read? I assumed that it would only do that if something had changed in the directory.

Is what I'm trying to do here even possible?

If not, is there any other better alternative to while True: look for changes ?

like image 557
gdm Avatar asked Jul 22 '09 14:07

gdm


3 Answers

FreeBSD and thus Mac OS X provide an analog of inotify called kqueue. Type man 2 kqueue on a FreeBSD machine for more information. For kqueue on Freebsd you have PyKQueue available at http://people.freebsd.org/~dwhite/PyKQueue/, unfortunately is not actively maintained so your mileage may vary.

like image 196
Kurt Avatar answered Oct 25 '22 18:10

Kurt


Why not use a Python wrapper for one of the libraries for monitoring file changes, like gamin or inotify (search for pyinotify, I'm only allowed to post one hyperlink as a new user...) - that's sure to be faster and the low-level stuff is already done at C level for you, using kernel interfaces...

like image 45
David Fraser Avatar answered Oct 25 '22 17:10

David Fraser


After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.

Looks like you already do synchronous polling, by checking the state at regular intervals. Don't worry about the time "spent" in sleep, it won't eat CPU time. It just passes control to the operating system which wakes the process up after a requested timeout.

You could consider asynchronous event loop using a library that listens to filesystem change notifications provided by the operating system, but consider first if it gives you any real benefits in this particular situation.

like image 36
Adam Byrtek Avatar answered Oct 25 '22 16:10

Adam Byrtek