Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

php "glob" and data-deduplication?

I have a php-application which is (per request) scanning for the existance of some files. (on a network share)

I'm using glob for this, cause usually i just know the beginning of the filename.

I noticed, that glob does not return files, that are currently opened by any client, thus my application thinks file_xy is not existing, if somebody has opened it.

Is there a way to make glob return opened (:= locked?) files as well?

The strange thing is, that this is no where mentioned. However I can confirm that glob is NOT returning files, that are currently opened by a client... (As soon as the client closes the accessing application, glob will return the file as usual)


ps.: not even glob("\\server\share\*") is returning the file as long as its opened. (Network Share allows the maximum number of concurrent users)


    $dir = opendir ("\\server\share");
    while ($file = readdir($dir)){
      echo $file."<br />";
    }

shows the file in question perfectly fine, no matter if opened by another client or not. - So I can almost exclude any access-limit / permission thingy...


I figured out the cause even if I do not know the reason now:

The Issue with glob() not finding an opened file appears, when the file is located on a drive that's using Windows Server 2012 R2 build in data-deduplication feature.

If I move the file to a non deduplicated share, glob() can read it, even when opened by multiple clients.


Since I have a working alternative, this question should mainly focus on the question why glob does not work - or let's say work different here. There has to be a difference in how glob and readdir are accessing the underlaying filesystem to determine the contents.


Another Proof

There is another proof, that this relates to data-deduplication: I configured the feature to "only" deduplicate files older than 3 days.

I set up a cronjob, "opening and globing" a certain file on the share. Once it was ~ 3 days old (Windows decides when to deduplicate), glob failed to list the file while its opened by another client.

Thus, glob is able to find open files, that has been copied to the share WITHIN the first 3 days - and then starts to miss it, once it has been deduplicated.

Observations

glob

glob fails, causing this post :-)

scandir

Using the mentioned scandir function shows the very same behavior:

  • deduplicated file opened by a client - missing in the resulting array.
  • deduplicated file not opened by a client - part of the resulting array.

opendir / readdir

I want to underline again, that opendir along with readdir works in both cases.

RecursiveDirectoryIterator

This produced the expected result at any time as well.

File Attributes

I noted, that deduplicated files are shown with a "Size on Harddrive" of 0 Bytes, while not yet deduplicated files (which are successfully found) are shown with the size they are logically occupying (based on filesystems cluster-size):

However this would not explain why it makes a difference whether a file is opened by a client or not. Size report is equal at any time.

File Attributes of deduplicated and not deduplicated file

like image 386
dognose Avatar asked Jun 21 '15 21:06

dognose


3 Answers

I'm not sure if this is what you're looking for but i use scandir() to list all the files in a directory, then you can excecute any command on them once you know the name. It will work on open files as well

PHP scandir documentation source

like image 85
AfikDeri Avatar answered Nov 16 '22 11:11

AfikDeri


This makes some sense if the intent of deduplication is not to have duplicates, then the files are being locked and php cant see them. The only thing to do is is see if this limitation applies to scandir() and the SPL directory/filesystem family of iterators as well. If so it may not be possible to get a list of them.

The only other choice would be to use exec() and a windows command line sort of hack to see if you can get a list of files and then parse the output. This may be useful

php exec: does not return output

Good luck!

like image 1
ArtisticPhoenix Avatar answered Nov 16 '22 11:11

ArtisticPhoenix


Did you try

$files = glob('{,.}*', GLOB_BRACE);

It might be possible that the data de-dupe feature is keeping the opened file as a hidden file.

like image 1
Rajat Garg Avatar answered Nov 16 '22 13:11

Rajat Garg