Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use RegexIterator in PHP

I have yet to find a good example of how to use the php RegexIterator to recursively traverse a directory.

The end result would be I want to specify a directory and find all files in it with some given extensions. Say for example only html/php extensions. Furthermore, I want to filter out folders such of the type .Trash-0, .Trash-500 etc.

<?php 
$Directory = new RecursiveDirectoryIterator("/var/www/dev/");
$It = new RecursiveIteratorIterator($Directory);
$Regex = new RegexIterator($It,'/^.+\.php$/i',RecursiveRegexIterator::GET_MATCH);

foreach($Regex as $v){
    echo $value."<br/>";
}
?>

Is what I have so far but result in : Fatal error: Uncaught exception 'UnexpectedValueException' with message 'RecursiveDirectoryIterator::__construct(/media/hdmovies1/.Trash-0)

Any suggestions?

like image 248
Chris Avatar asked Jul 23 '10 19:07

Chris


3 Answers

There are a couple of different ways of going about something like this, I'll give two quick approaches for you to choose from: quick and dirty, versus longer and less dirty (though, it's a Friday night so we're allowed to go a little bit crazy).

1. Quick (and dirty)

This involves just writing a regular expression (could be split into multiple) to use to filter the collection of files in one quick swoop.

(Only the two commented lines are really important to the concept.)

$directory = new RecursiveDirectoryIterator(__DIR__);
$flattened = new RecursiveIteratorIterator($directory);

// Make sure the path does not contain "/.Trash*" folders and ends eith a .php or .html file
$files = new RegexIterator($flattened, '#^(?:[A-Z]:)?(?:/(?!\.Trash)[^/]+)+/[^/]+\.(?:php|html)$#Di');

foreach($files as $file) {
    echo $file . PHP_EOL;
}

This approach has a number of issues, though it is quick to implement being just a one-liner (though the regex might be a pain to decipher).

2. Less quick (and less dirty)

A more re-usable approach is to create a couple of bespoke filters (using regex, or whatever you like!) to whittle down the list of available items in the initial RecursiveDirectoryIterator down to only those that you want. The following is only one example, written quickly just for you, of extending the RecursiveRegexIterator.

We start with a base class whose main job is to keep a hold of the regex that we want to filter with, everything else is deferred back to the RecursiveRegexIterator. Note that the class is abstract since it doesn't actually do anything useful: the actual filtering is to be done by the two classes which will extend this one. Also, it may be called FilesystemRegexFilter but there is nothing forcing it (at this level) to filter filesystem-related classes (I'd have chosen a better name, if I weren't quite so sleepy).

abstract class FilesystemRegexFilter extends RecursiveRegexIterator {
    protected $regex;
    public function __construct(RecursiveIterator $it, $regex) {
        $this->regex = $regex;
        parent::__construct($it, $regex);
    }
}

These two classes are very basic filters, acting on the file name and directory name respectively.

class FilenameFilter extends FilesystemRegexFilter {
    // Filter files against the regex
    public function accept() {
        return ( ! $this->isFile() || preg_match($this->regex, $this->getFilename()));
    }
}

class DirnameFilter extends FilesystemRegexFilter {
    // Filter directories against the regex
    public function accept() {
        return ( ! $this->isDir() || preg_match($this->regex, $this->getFilename()));
    }
}

To put those into practice, the following iterates recursively over the contents of the directory in which the script resides (feel free to edit this!) and filters out the .Trash folders (by making sure that folder names do match the specially crafted regex), and accepting only PHP and HTML files.

$directory = new RecursiveDirectoryIterator(__DIR__);
// Filter out ".Trash*" folders
$filter = new DirnameFilter($directory, '/^(?!\.Trash)/');
// Filter PHP/HTML files 
$filter = new FilenameFilter($filter, '/\.(?:php|html)$/');

foreach(new RecursiveIteratorIterator($filter) as $file) {
    echo $file . PHP_EOL;
}

Of particular note is that since our filters are recursive, we can choose to play around with how to iterate over them. For example, we could easily limit ourselves to only scanning up to 2 levels deep (including the starting folder) by doing:

$files = new RecursiveIteratorIterator($filter);
$files->setMaxDepth(1); // Two levels, the parameter is zero-based.
foreach($files as $file) {
    echo $file . PHP_EOL;
}

It is also super-easy to add yet more filters (by instantiating more of our filtering classes with different regexes; or, by creating new filtering classes) for more specialised filtering needs (e.g. file size, full-path length, etc.).

P.S. Hmm this answer babbles a bit; I tried to keep it as concise as possible (even removing vast swathes of super-babble). Apologies if the net result leaves the answer incoherent.

like image 111
salathe Avatar answered Oct 21 '22 22:10

salathe


The docs are indeed not much helpful. There's a problem using a regex for 'does not match' here, but we'll illustrate a working example first:

<?php 
//we want to iterate a directory
$Directory = new RecursiveDirectoryIterator("/var/dir");

//we need to iterate recursively
$It        = new RecursiveIteratorIterator($Directory);

//We want to stop decending in directories named '.Trash[0-9]+'
$Regex1    = new RecursiveRegexIterator($It,'%([^0-9]|^)(?<!/.Trash-)[0-9]*$%');

//But, still continue on doing it **recursively**
$It2       = new RecursiveIteratorIterator($Regex1); 

//Now, match files
$Regex2    = new RegexIterator($It2,'/\.php$/i');
foreach($Regex2 as $v){
  echo $v."\n";
}
?>

The problem is the doesn't match .Trash[0-9]{3} part: The only way I know how to negative match the directory, is match the end of the string $, and then then assert with a lookbehind (?<!/foo) 'if it is not preceded by '/foo'.

However, as .Trash[0-9]{1,3} is not fixed length, we cannot use it as a lookbehind assertion. Unfortunately, there is no 'invert match' for a RegexIterator. But perhaps there are more regex-savvy people then I knowing how to match 'any string not ending with .Trash[0-9]+


edit: got it '%([^0-9]|^)(?<!/.Trash-)[0-9]*$%' as a regex would do the trick.

like image 9
Wrikken Avatar answered Oct 21 '22 22:10

Wrikken


An improvement to salathe, would be to forget about the custom abstract class. Just use good OOP in PHP and directly extend RecursiveRegexIterator instead:

Here is the File filter

class FilenameFilter 
extends RecursiveRegexIterator 
{
    // Filter files against the regex
    public function accept() 
    {
        return ! $this->isFile() || parent::accept();
    }
}

And the Directory filter

class DirnameFilter 
extends RecursiveRegexIterator 
{
    // Filter directories against the regex
    public function accept() {
        return ! $this->isDir() || parent::accept();
    }
}
like image 1
Guillermo Avatar answered Oct 21 '22 22:10

Guillermo