In my app I need to watch a directory for new files. The amount of traffic is very large and there are going to be a minimum of hundreds of new files per second appearing. Currently I'm using a busy loop with this kind of idea:
while True:
time.sleep(0.2)
if len(os.listdir('.')) > 0:
# do stuff
After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.
I'm trying to use one of the available classes in select
to poll my directory, but I'm not sure if it actually works, or if I'm just doing it wrong.
I get an fd for my directory with:
fd = os.open('.', os.O_DIRECT)
I've then tried several methods to see when the directory changes. As an example, one of the things I tried was:
poll = select.poll()
poll.register(fd, select.POLLIN)
poll.poll() # returns (fd, 1) meaning 'ready to read'
os.read(fd, 4096) # prints largely gibberish but i can see that i'm pulling the files/folders contained in the directory at least
poll.poll() # returns (fd, 1) again
os.read(fd, 4096) # empty string - no more data
Why is poll() acting like there is more information to read? I assumed that it would only do that if something had changed in the directory.
Is what I'm trying to do here even possible?
If not, is there any other better alternative to while True: look for changes
?
FreeBSD and thus Mac OS X provide an analog of inotify called kqueue. Type man 2 kqueue on a FreeBSD machine for more information. For kqueue on Freebsd you have PyKQueue available at http://people.freebsd.org/~dwhite/PyKQueue/, unfortunately is not actively maintained so your mileage may vary.
Why not use a Python wrapper for one of the libraries for monitoring file changes, like gamin or inotify (search for pyinotify, I'm only allowed to post one hyperlink as a new user...) - that's sure to be faster and the low-level stuff is already done at C level for you, using kernel interfaces...
After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.
Looks like you already do synchronous polling, by checking the state at regular intervals. Don't worry about the time "spent" in sleep
, it won't eat CPU time. It just passes control to the operating system which wakes the process up after a requested timeout.
You could consider asynchronous event loop using a library that listens to filesystem change notifications provided by the operating system, but consider first if it gives you any real benefits in this particular situation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With