Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing inotify events (in .git directory)

I'm watching files for changes using inotify events (as it happens, from Python, calling into libc).

For some files during a git clone, I see something odd: I see an IN_CREATE event, and I see via ls that the file has content, however, I never see IN_MODIFY or IN_CLOSE_WRITE. This is causing me issues since I would like to respond to IN_CLOSE_WRITE on the files: specifically, to initiate an upload of the file contents.

The files that behave oddly are in the .git/objects/pack directory, and they end in .pack or .idx. Other files that git creates have a more regular IN_CREATE -> IN_MODIFY -> IN_CLOSE_WRITE chain (I'm not watching for IN_OPEN events).

This is inside docker on MacOS, but I have seen evidence of the same on docker on Linux in a remote system, so my suspicion is the MacOS aspect is not relevant. I am seeing this if watching and git clone are in the same docker container.

My questions:

  • Why are these events missing on these files?

  • What can be done about it? Specifically, how can I respond to the completion of writes to these files? Note: ideally I would like to respond when writing is "finished" to avoid needlessly/(incorrectly) uploading "unfinished" writing.


Edit: Reading https://developer.ibm.com/tutorials/l-inotify/ it looks like what I'm seeing is consistent with

  • a separate temporary file, with name like tmp_pack_hBV4Alz, being created, modified and, closed;
  • a hard link is created to this file, with the final .pack name;
  • the original tmp_pack_hBV4Alz name is deleted.

I think my problem, which is trying to use inotify as a trigger to upload files, then reduces to noticing that the .pack file is a hard link to another file, and uploading in this case?

like image 527
Michal Charemza Avatar asked Jan 22 '20 16:01

Michal Charemza


5 Answers

To answer your question separately for git 2.24.1 on Linux 4.19.95:

  • Why are these events missing on these files?

You don't see IN_MODIFY/IN_CLOSE_WRITE events because git clone will always try to use hard links for files under the .git/objects directory. When cloning over the network or across file system boundaries, these events will appear again.

  • What can be done about it? Specifically, how can I respond to the completion of writes to these files? Note: ideally I would like to respond when writing is "finished" to avoid needlessly/(incorrectly) uploading "unfinished" writing.

In order to catch modification of hard links you have to set up a handler for the inotify CREATE event which follows and keeps track of those links. Please note that a simple CREATE can also mean that a nonempty file was created. Then, on IN_MODIFY/IN_CLOSE_WRITE to any of the files you have to trigger the same action on all linked files as well. Obviously you also have to remove that relationship on the DELETE event.

A simpler and more robust approach would probably be to just periodically hash all the files and check if the content of a file has changed.


Correction

After checking the git source code closely and running git with strace, I found that git does use memory mapped files, but mostly for reading content. See the usage of xmmap which is always called with PROT_READ only.. Therefore my previous answer below is NOT the correct answer. Nevertheless for informational purpose I would still like to keep it here:

  • You don't see IN_MODIFY events because packfile.c uses mmap for file access and inotify does not report modifications for mmaped files.

    From the inotify manpage:

    The inotify API does not report file accesses and modifications that may occur because of mmap(2), msync(2), and munmap(2).

like image 151
Ente Avatar answered Oct 03 '22 06:10

Ente


There is another possibility (from man inotify):

Note that the event queue can overflow. In this case, events are lost. Robust applications should handle the possibility of lost events gracefully. For example, it may be necessary to rebuild part or all of the application cache. (One simple, but possibly expensive, approach is to close the inotify file descriptor, empty the cache, create a new inotify file descriptor, and then re-create watches and cache entries for the objects to be monitored.)

And while git clone can generate heavy event flow, this can happen.

How to avoid this:

  1. Increase read buffer, try fcntl(F_SETPIPE_SZ) (this approach is a guess, I've never tried).
  2. Read events into a big buffer in a dedicated thread, process events in another thread.
like image 21
Yury Nevinitsin Avatar answered Oct 03 '22 05:10

Yury Nevinitsin


I may speculate that Git most of the time uses atomic file updates which are done like this:

  1. A file's contents is read into memory (and modified).
  2. The modified contents is written into a separate file (usually located in the same directory as the original one, and having a randomized (mktemp-style) name.
  3. The new file is then rename(2)d -d over the original one; this operation guarantees that every observer trying to open the file using its name will get either the old contents or the new one.

Such updates are seen by inotify(7) as moved_to events—since a file "reappears" in a directory.

like image 26
kostix Avatar answered Oct 03 '22 05:10

kostix


Based on this accepted answer I'd assume there might be some difference in the events based on the protocol being used (i.e. ssh or https).

Do you observe the same behavior when monitoring cloning from the local filesystem with the --no-hardlinks option?

$ git clone [email protected]:user/repo.git
# set up watcher for new dir
$ git clone --no-hardlinks repo new-repo

Your observed behavior on running the experiment on both a linux and Mac host probably eliminates this open issue being the cause https://github.com/docker/for-mac/issues/896 but adding just incase.

like image 45
deric4 Avatar answered Oct 03 '22 04:10

deric4


Maybe you made the same mistake I made years ago. I've only used inotify twice. The first time, my code simply worked. Later, I no longer had that source and started again, but this time, I was missing events and did not know why.

It turns out that when I was reading an event, I was really reading a small batch of events. I parsed the one I expected, thinking that was it, that was all. Eventually, I discovered there is more to that received data, and when I added a little code to parse all events received from a single read, no more events were lost.

like image 44
donjuedo Avatar answered Oct 03 '22 04:10

donjuedo