Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is file readable (contains text rather is accessible )

i am working on a project that reads all files from local Hdd, i specify the extensions i would like to include in the search.

all chosen file extentions are based on the fact that the file is of text content.

so for my use i could specify which extensions to take into acount, such as .cs .html .htm .css .js etc'

what if i want to add a feature that would let generic user to select extensions and let him choose from all available windows file extensions but to include in that list only those file in his system that are textual. for instance we know that exe, mp3. mpg, avi are not but he could have some other types of files (.extensions) that we did not take into account.

is there a way to decide that based on system file property, if not what would be the way to filter only text content files?

like image 625
Jbob Johan Avatar asked Nov 14 '15 18:11

Jbob Johan


1 Answers

One mechanism for Windows machines is to look up the Content Type in the Windows Registry associated with the file extension. (I do not know of a way to do this without a direct registry lookup.)

Within the registry, file extensions that are text-based should generally have one or more of these characteristics:

  • A Content Type indicating MIME primary type of text, e.g, text/plain or text/application
  • A Perceived Type of text
  • A default handler with the GUID {5e941d80-bf96-11cd-b579-08002b30bfeb}, assigned to the plain text persistent handler.

The following method will return all system extensions associated with these characteristics:

// include using reference to Microsoft.Win32;
static IEnumerable<string> GetTextExtensions()
{
    var defaultcomp = StringComparison.InvariantCultureIgnoreCase;
    var root = Registry.ClassesRoot;
    foreach (var s in root.GetSubKeyNames()
        .Where(a => a.StartsWith(".")))
    {
        using (RegistryKey subkey = root.OpenSubKey(s))
        {
            if (subkey.GetValue("Content Type")?.ToString().StartsWith("text/", defaultcomp) == true)
                yield return s;
            else if (subkey.GetValue("PerceivedType")?.ToString().Equals("text", defaultcomp) == true)
                yield return s;
            else
            {
                using (var ph = subkey.OpenSubKey("PersistentHandler"))
                {
                    if (ph?.GetValue("")?.ToString().Equals("{5e941d80-bf96-11cd-b579-08002b30bfeb}", defaultcomp) == true)
                        yield return s;
                }

            }
        }
    }
}

The output depends on the workstation configuration, but on my current machine returns:

.a, .AddIn, .ans, .asc, .asm, .asmx, .aspx, .asx, .bas, .bat, .bcp, .c, .cc, .cd, .cls, .cmd, ...

While this depends on application installers correctly mapping file extensions, it appears to identify most of the major text file types.

like image 124
drf Avatar answered Oct 20 '22 04:10

drf