Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem on Tag "Filesystem"

For recreational reasons I wrote a PHP class that classifies files with tags instead of in a hierarchical way, the tags are stored in the filename itself in the form of +tag1+tag2+tagN+MD5.EXTENSION and thus I'm stucked with the chars limit (255) imposed by the FS/OS. Here is the class:

<?php

class TagFS
{
    public $FS = null;

    function __construct($FS)
    {
        if (is_dir($FS) === true)
        {
            $this->FS = $this->Path($FS);
        }
    }

    function Add($path, $tag)
    {
        if (is_dir($path) === true)
        {
            $files = array_slice(scandir($path), 2);

            foreach ($files as $file)
            {
                $this->Add($this->Path($path) . $file, $tag);
            }

            return true;
        }

        else if (is_file($path) === true)
        {
            $file = md5_file($path);

            if (is_file($this->FS . $file) === false)
            {
                if (copy($path, $this->FS . $file) === false)
                {
                    return false;
                }
            }

            return $this->Link($this->FS . $file, $this->FS . '+' . $this->Tag($tag) . '+' . $file . '.' . strtolower(pathinfo($path, PATHINFO_EXTENSION)));
        }

        return false;
    }

    function Get($tag)
    {
        return glob($this->FS . '*+' . str_replace('+', '{+,+*+}', $this->Tag($tag)) . '+*', GLOB_BRACE);
    }

    function Link($source, $destination)
    {
        if (is_file($source) === true)
        {
            if (function_exists('link') === true)
            {
                return link($source, $destination);
            }

            if (is_file($destination) === false)
            {
                exec('fsutil hardlink create "' . $destination . '" "' . $source . '"');

                if (is_file($destination) === true)
                {
                    return true;
                }
            }
        }

        return false;
    }

    function Path($path)
    {
        if (file_exists($path) === true)
        {
            $path = str_replace('\\', '/', realpath($path));

            if ((is_dir($path) === true) && ($path[strlen($path) - 1] != '/'))
            {
                $path .= '/';
            }

            return $path;
        }

        return false;
    }

    function Tag($string)
    {
        /*
        TODO:
        Remove (on Windows):            . \ / : * ? " < > |
        Remove (on *nix):               . /
        Remove (on TagFS):              + * { }
        Remove (on TagFS - Possibly!)   -
        Max Chars (in Windows)          255
        Max Char (in *nix)              255
        */

        $result = array_filter(array_unique(explode(' ', $string)));

        if (empty($result) === false)
        {
            if (natcasesort($result) === true)
            {
                return strtolower(implode('+', $result));
            }
        }

        return false;
    }
}

?>

I believe this system works well for a couple of small tags, but my problem is when the size of the whole filename exceeds 255 chars. What approach should I take in order to bypass the filename limit? I'm thinking in splitting tags on several hard links of the same file, but the permutations may kill the system.

Are there any other ways to solve this problem?

EDIT - Some usage examples:

<?php

$images = new TagFS('S:');

$images->Add('P:/xampplite/htdocs/tag/geoaki.png', 'geoaki logo');
$images->Add('P:/xampplite/htdocs/tag/cloud.jpg', 'geoaki cloud tag');
$images->Add('P:/xampplite/htdocs/tag/cloud.jpg', 'nuvem azul branco');
$images->Add('P:/xampplite/htdocs/tag/xml-full.gif', 'geoaki auto vin api service xml');
$images->Add('P:/xampplite/htdocs/tag/dunp3d-1.jpg', 'dunp logo');
$images->Add('P:/xampplite/htdocs/tag/d-proposta-04c.jpg', 'dunp logo');

/*
[0] => S:/+api+auto+geoaki+service+vin+xml+29be189cbc98fcb36a44d77acad13e18.gif
[1] => S:/+azul+branco+nuvem+4151ae7900f33788d0bba5fc6c29bee3.jpg
[2] => S:/+cloud+geoaki+tag+4151ae7900f33788d0bba5fc6c29bee3.jpg
[3] => S:/+dunp+logo+0cedeb6f66cbfc3974c6b7ad86f4fbd3.jpg
[4] => S:/+dunp+logo+8b9fcb119246bb6dcac1906ef964d565.jpg
[5] => S:/+geoaki+logo+5f5174c498ffbfd9ae49975ddfa2f6eb.png
*/
echo '<pre>';
print_r($images->Get('*'));
echo '</pre>';

/*
[0] => S:/+azul+branco+nuvem+4151ae7900f33788d0bba5fc6c29bee3.jpg
*/
echo '<pre>';
print_r($images->Get('azul nuvem'));
echo '</pre>';

/*
[0] => S:/+dunp+logo+0cedeb6f66cbfc3974c6b7ad86f4fbd3.jpg
[1] => S:/+dunp+logo+8b9fcb119246bb6dcac1906ef964d565.jpg
[2] => S:/+geoaki+logo+5f5174c498ffbfd9ae49975ddfa2f6eb.png
*/
echo '<pre>';
print_r($images->Get('logo'));
echo '</pre>';

/*
[0] => S:/+dunp+logo+0cedeb6f66cbfc3974c6b7ad86f4fbd3.jpg
[1] => S:/+dunp+logo+8b9fcb119246bb6dcac1906ef964d565.jpg
*/
echo '<pre>';
print_r($images->Get('logo dunp'));
echo '</pre>';

/*
[0] => S:/+geoaki+logo+5f5174c498ffbfd9ae49975ddfa2f6eb.png
*/
echo '<pre>';
print_r($images->Get('geo* logo'));
echo '</pre>';

?>

EDIT: Due to the several suggestions to use a serverless database or any other type of lookup table (XML, flat, key/value pairs, etc) I want to clarify the following: although this code is written in PHP, the idea is to port it to Python and make a desktop application out of it - this has noting to do (besides the example of course) with PHP. Furthermore, if I have to use some kind of lookup table I'll definitely go with SQLite 3, but what I'm looking for is a solution that doesn't involves any other additional "technology" besides the filesystem (folders, files and hardlinks).

You may call me nuts but I'm trying to accomplish two simple goals here: 1) keep the system "garbage" free (who likes Thumbs.db or DS_STORE for example?) and 2) keep the files easily identifiable if for some reason the lookup table (in this case SQLite) gets busy, corrupt, lost or forgot (in backups for instance).

PS: This is supposed to run on both Linux, Mac, and Windows (under NTFS).

like image 737
Alix Axel Avatar asked May 18 '09 03:05

Alix Axel


4 Answers

If you have use of hard/soft links than you might look into giving each tag it's own directory having a link for each file with that "tag." Then when you are given multiple tags you can compare those found in both. Then the files could be stored in a single folder and having them unique in name of course.

I don't know how this would be different from having a meta file named by the tag, then listing all files that exist in that tag.

like image 96
he_the_great Avatar answered Oct 04 '22 23:10

he_the_great


You may want to create a cache of tags for each folder your concerned with, similar to the way Windows creates a Thumbs.db file to cache thumbs when browsing folders.

Creating a metadata file like this has the advantage of working across many different file systems without encountering a file name limitation.

like image 41
Joey Robert Avatar answered Oct 05 '22 00:10

Joey Robert


I would insert that information into a database, even if it's a lightweight one, like an sqlite file in the same directory.

If you don't want to do that, you could create hard links to the file without any permutations. One file per tag. Tagging P:/xampplite/htdocs/tag/geoaki.png with geoaki and logo would result in two files both being hard links pointing to the same data as the original file:

  • P:/xampplite/htdocs/tag/geoaki.png.geoaki)
  • P:/xampplite/htdocs/tag/geoaki.png.logo)

This has the advantage that you can select all tags belonging to that file with glob() for example.

# All tags
$tags = array();
files = glob('P:/xampplite/htdocs/tag/geoaki.png.*')
foreach ($files as $file) {
    if (fileinode($file) === fileinode('P:/xampplite/htdocs/tag/geoaki.png')) {
        $tags[] = substr($file, strlen('P:/xampplite/htdocs/tag/geoaki.png.'));
    }
}

# Check if file has tag foo:
file_exists('P:/xampplite/htdocs/tag/geoaki.png.foo')
    && fileinode(P:/xampplite/htdocs/tag/geoaki.png.foo) === fileinode('P:/xampplite/htdocs/tag/geoaki.png');

One more thing: Relying on md5 hashes alone for identifying files is not safe, you're better off using the file name as the identifier, which is guaranteed to be unique within the folder. Negative effects of md5 as identifier are:

  • The system breaks, as soon as a file is changed
  • There are collisions in md5, two distinct files could have the same md5 hash (the probability is small, but existent)
like image 41
soulmerge Avatar answered Oct 05 '22 01:10

soulmerge


You've narrowed the question sufficiently that I believe the answer is: "No."

You don't want a central registry of tags because it could become corrupted.

You don't want file or files hidden in each directory to hold the data because that is "garbage".

You probably don't want a parallel set of directories or directories with links, because then it goes out of date when you move stuff and probably constitutes "garbage" on the file system.

You surely don't want to put tags in the contents of the files themselves.

So is there anywhere else you could put tags aside from the file's name in the directory structure?

No. (Or at least there is nothing portable).

Certainly there is nowhere to keep metadata except in the file's name or in the actual file itself that would stay with a file (when it is copied and moved using the usual tools) that would work on all three of the major operating systems you mention (Linux, Mac, Win).

It would be nice if there was a portable metadata system that could do this, but there is not. My impression is that there is no general agreement on what the best way to do tagging is. So each system does it differently and with a different set of trade-offs.

I think that relative to most of the major ideas in operating systems (hierarchical filesystems, GUI interfaces, etc), using tagging is a relatively new idea. Most of the facilities shared across all three systems are rather old and established ideas.

Your best bet would probably be to study how each system does it and then write a library that would portably provide the lowest common denominator of functionality between systems.

Maybe someone has written a library for Python that does this already?

C.J.

like image 23
CJ. Avatar answered Oct 05 '22 00:10

CJ.