Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get unique file identifier from a file

Tags:

c#

Before you mark this question as duplicate please read what I write. I have checked many questions in a lot of pages for the solution but could not find anything. On my current application I was using this :

using (var md5 = MD5.Create())
{
    using (FileStream stream = File.OpenRead(FilePath))
    {
        var hash = md5.ComputeHash(stream);
        var cc = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
        Console.WriteLine("Unique ID  : " + cc);
    }
}

This was working well enough to me for small sized files but once I try it with high size files it took me around 30-60 second to get the file ID.

I wonder if there is any other way to get something unique from a file with or without using hashing or stream? My target machine is not NTFS or windows all the time so I have to find another way.

I was wondering if it makes sense if I just get the first "x" amount of bytes from the stream and do the hashing for unique ID with that lowered-size stream?

EDIT : It's not for security thing or anything else, I need this unique ID because FileSystemWatcher is not working :)

EDIT2: Based on comments I decide to update my question. The reason why I do this maybe there is a solution that is not based on creating unique ID's for file. My problem is I have to watch a folder and fire events when there are; A) Newly added files B) Changed files C) Deleted files

The reason why I can't use FileSystemWatcher is it's not reliable. Sometimes I put 100x file to the folder and FileSystemWatcher only fires 20x-30x events and if it's network drive it can be lower sometimes. My method was saving all the files and their unique ID's into a text file and check the index file every 5 second if there are any changes. If there are no big files like 18GB it's working fine.. But computing hash of 40GB file takes way too long.. My question is : How can I fire events when something happen to the folder I am watching

EDIT3: After setting bounty I realized I need to give more information about what's going on in my code. First this is my answer to user @JustShadow (It was too long so I could not send it as comment) I will explain how I do it, I save filepath-uniqueID(MD5 hashed) in text file and every 5 second I check the folder with Directory.GetFiles(DirectoryPath); Then I compare my first list with the list I had 5 second ago and this way I get 2 lists

List<string> AddedList = FilesInFolder.Where(x => !OldList.Contains(x)).ToList();
List<string> RemovedList = OldList.Where(x => !FilesInFolder.Contains(x)).ToList();

This is how I get them. Now I have my if blocks,

if (AddedList.Count > 0 && RemovedList.Count == 0) then it's nice no renames only new files. I hash all new files and add them into my textfile.

if (AddedList.Count == 0 && RemovedList.Count > 0)

Opposite of first if still nice there are only removed item, I remove them from text file on this one and its done. After this situations there comes my else block .. Which is where I do my comparing, basically I hash all added and removed list items then I take the ones that exists in both list, as example a.txt renamed into b.txt in this case both of my list's count will be greater then zero so else triggered. Inside else I already know a's hashed value (it's inside my text file I have created 5 second ago) now I compare it with all AddedList elements and see if I can match them if I get a match then it's a rename situation if there is no match then I can say b.txt has really newly added to list since last scan. I will also provide some of my class code so maybe there is a way to solve this riddle.

Now I will also share some of my class code maybe we can find a way to solve it when everyone knows what I'm actually doing. This is how my timer looks like

private void TestTmr_Elapsed(object sender, System.Timers.ElapsedEventArgs e)
        {

            lock (locker)
            {
                if (string.IsNullOrWhiteSpace(FilePath))
                {
                    Console.WriteLine("Timer will be return because FilePath is empty. --> " + FilePath);
                    return;
                }
                try
                {
                    if (!File.Exists(FilePath + @"\index.MyIndexFile"))
                    {
                        Console.WriteLine("File not forund. Will be created now.");
                        FileStream close = File.Create(FilePath + @"\index.MyIndexFile");
                        close.Close();
                        return;
                    }

                    string EncryptedText = File.ReadAllText(FilePath + @"\index.MyIndexFile");
                    string JsonString = EncClass.Decrypt(EncryptedText, "SecretPassword");
                    CheckerModel obj = Newtonsoft.Json.JsonConvert.DeserializeObject<CheckerModel>(JsonString);
                    if (obj == null)
                    {
                        CheckerModel check = new CheckerModel();
                        FileInfo FI = new FileInfo(FilePath);
                        check.LastCheckTime = FI.LastAccessTime.ToString();
                        string JsonValue = Newtonsoft.Json.JsonConvert.SerializeObject(check);

                        if (!File.Exists(FilePath + @"\index.MyIndexFile"))
                        {
                            FileStream GG = File.Create(FilePath + @"\index.MyIndexFile");
                            GG.Close();
                        }

                        File.WriteAllText(FilePath + @"\index.MyIndexFile", EncClass.Encrypt(JsonValue, "SecretPassword"));
                        Console.WriteLine("DATA FILLED TO TEXT FILE");
                        obj = Newtonsoft.Json.JsonConvert.DeserializeObject<CheckerModel>(JsonValue);
                    }
                    DateTime LastAccess = Directory.GetLastAccessTime(FilePath);
                    string[] FilesInFolder = Directory.GetFiles(FilePath, "*.*", SearchOption.AllDirectories);
                    List<string> OldList = new List<string>(obj.Files.Select(z => z.Path).ToList());

                    List<string> AddedList = FilesInFolder.Where(x => !OldList.Contains(x)).ToList();
                    List<string> RemovedList = OldList.Where(x => !FilesInFolder.Contains(x)).ToList();


                    if (AddedList.Count == 0 & RemovedList.Count == 0)
                    {
                        //no changes.
                        Console.WriteLine("Nothing changed since last scan..!");
                    }
                    else if (AddedList.Count > 0 && RemovedList.Count == 0)
                    {
                        Console.WriteLine("Adding..");
                        //Files added but removedlist is empty which means they are not renamed. Fresh added..
                        List<System.Windows.Forms.ListViewItem> LvItems = new List<System.Windows.Forms.ListViewItem>();
                        for (int i = 0; i < AddedList.Count; i++)
                        {
                            LvItems.Add(new System.Windows.Forms.ListViewItem(AddedList[i] + " has added since last scan.."));
                            FileModel FileItem = new FileModel();
                            using (var md5 = MD5.Create())
                            {
                                using (FileStream stream = File.OpenRead(AddedList[i]))
                                {
                                    FileItem.Size = stream.Length.ToString();
                                    var hash = md5.ComputeHash(stream);
                                    FileItem.Id = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
                                }
                            }
                            FileItem.Name = Path.GetFileName(AddedList[i]);
                            FileItem.Path = AddedList[i];
                            obj.Files.Add(FileItem);
                        }
                    }
                    else if (AddedList.Count == 0 && RemovedList.Count > 0)
                    {
                        //Files removed and non has added which means files have deleted only. Not renamed.
                        for (int i = 0; i < RemovedList.Count; i++)
                        {
                            Console.WriteLine(RemovedList[i] + " has been removed from list since last scan..");
                            obj.Files.RemoveAll(x => x.Path == RemovedList[i]);
                        }
                    }
                    else
                    {
                        //Check for rename situations..

                        //Scan newly added files for MD5 ID's. If they are same with old one that means they are renamed.
                        //if a newly added file has a different MD5 ID that is not represented in old ones this file is fresh added.
                        for (int i = 0; i < AddedList.Count; i++)
                        {
                            string NewFileID = string.Empty;
                            string NewFileSize = string.Empty;
                            using (var md5 = MD5.Create())
                            {
                                using (FileStream stream = File.OpenRead(AddedList[i]))
                                {
                                    NewFileSize = stream.Length.ToString();
                                    var hash = md5.ComputeHash(stream);
                                    NewFileID = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
                                }
                            }
                            FileModel Result = obj.Files.FirstOrDefault(x => x.Id == NewFileID);
                            if (Result == null)
                            {
                                //Not a rename. It's fresh file.
                                Console.WriteLine(AddedList[i] + " has added since last scan..");
                                //Scan new file and add it to the json list.

                            }
                            else
                            {
                                Console.WriteLine(Result.Path + " has renamed into --> " + AddedList[i]);
                                //if file is replaced then it should be removed from RemovedList
                                RemovedList.RemoveAll(x => x == Result.Path);
                                obj.Files.Remove(Result);
                                //After removing old one add new one. This way new one will look like its renamed
                                FileModel ModelToadd = new FileModel();
                                ModelToadd.Id = NewFileID;
                                ModelToadd.Name = Path.GetFileName(AddedList[i]);
                                ModelToadd.Path = AddedList[i];
                                ModelToadd.Size = NewFileSize;
                                obj.Files.Add(ModelToadd);
                            }

                        }

                        //After handle AddedList we should also inform user for removed files 
                        for (int i = 0; i < RemovedList.Count; i++)
                        {
                            Console.WriteLine(RemovedList[i] + " has deleted since last scan.");
                        }
                    }

                    //Update Json after checking everything.
                    obj.LastCheckTime = LastAccess.ToString();
                    File.WriteAllText(FilePath + @"\index.MyIndexFile", EncClass.Encrypt(Newtonsoft.Json.JsonConvert.SerializeObject(obj), "SecretPassword"));


                }
                catch (Exception ex)
                {
                    Console.WriteLine("ERROR : " + ex.Message);
                    Console.WriteLine("Error occured --> " + ex.Message);
                }
                Console.WriteLine("----------- END OF SCAN ----------");
            }
        }
like image 327
Shino Lex Avatar asked Mar 12 '19 10:03

Shino Lex


People also ask

What is the unique identifier of a file?

A Unique Identifier (UID) uniquely identifies a resource. This means that the identifier may change for the particular embodiment of the resource and each copy of the resource has its own ID. It consequently means that the UID are URL's.

How do I identify unique files in Windows?

File ID is a unique file identifier used on windows to identify a unique file on a Volume. File Id works similar in spirit to a inode number found in *nix Distributions. Such that a fileId could be used to uniquely identify a file in a volume.

How can we assign a unique identifier?

The simplest way to generate identifiers is by a serial number. A steadily increasing number that is assigned to whatever you need to identify next. This is the approached used in most internal databases as well as some commonly encountered public identifiers.

What is locally unique identifier?

The locally unique identifier (LUID) is a 64-bit value guaranteed to be unique only on the system on which it was generated. The uniqueness of an LUID is guaranteed only until the system is restarted. An LUID is not for direct manipulation. Drivers must use support routines and structures to manipulate LUID values.


2 Answers

As to your approach

  1. No guarantee exists that checksum (cryptographic or non) collisions can be avoided, no matter how unlikely.
  2. The more you process of a file, the less likely.
  3. The IO of continually parsing files is incredibly expensive.
  4. Windows knows when files are changing, so it's best to use the provided monitoring mechanism.

FileSystemWatcher has a buffer, its default size is 8192, min 4KB, max 64KB. When events are missed it is typically (in my experience only) because the buffer size is too small. Example code follows. In my test I dropped 296 files into (empty) C:\Temp folder. Every copy resulted in 3 events. None were missed.

using System;
using System.IO;
using System.Threading;

namespace FileSystemWatcherDemo
{
  class Program
  {
    private static volatile int Count = 0;
    private static FileSystemWatcher Fsw = new FileSystemWatcher
    {
      InternalBufferSize = 48 * 1024,  //  default 8192 bytes, min 4KB, max 64KB
      EnableRaisingEvents = false
    };
    private static void MonitorFolder(string path)
    {
      Fsw.Path = path;
      Fsw.Created += FSW_Add;
      Fsw.Created += FSW_Chg;
      Fsw.Created += FSW_Del;
      Fsw.EnableRaisingEvents = true;
    }

    private static void FSW_Add(object sender, FileSystemEventArgs e) { Console.WriteLine($"ADD: {++Count} {e.Name}"); }
    private static void FSW_Chg(object sender, FileSystemEventArgs e) { Console.WriteLine($"CHG: {++Count} {e.Name}"); }
    private static void FSW_Del(object sender, FileSystemEventArgs e) { Console.WriteLine($"DEL: {++Count} {e.Name}"); }
    static void Main(string[] args)
    {
      MonitorFolder(@"C:\Temp\");
      while (true)
      {
        Thread.Sleep(500);
        if (Console.KeyAvailable) break;
      }
      Console.ReadKey();  //  clear buffered keystroke
      Fsw.EnableRaisingEvents = false;
      Console.WriteLine($"{Count} file changes detected");
      Console.ReadKey();
    }
  }
}

Results

ADD: 880 tmpF780.tmp
CHG: 881 tmpF780.tmp
DEL: 882 tmpF780.tmp
ADD: 883 vminst.log
CHG: 884 vminst.log
DEL: 885 vminst.log
ADD: 886 VSIXbpo3w5n5.vsix
CHG: 887 VSIXbpo3w5n5.vsix
DEL: 888 VSIXbpo3w5n5.vsix
888 file changes detected
like image 150
AlanK Avatar answered Oct 25 '22 02:10

AlanK


You might consider using CRC checksums instead which work much faster.
Here is how to calculate CRC64 checksum with C#:

Crc64 crc64 = new Crc64();
String hash = String.Empty;

using (FileStream fs = File.Open("c:\\myBigFile.raw", FileMode.Open))
  foreach (byte b in crc64.ComputeHash(fs)) hash += b.ToString("x2").ToLower();

Console.WriteLine("CRC-64 is {0}", hash);

This calculated the checksum of my 4GB file within few seconds.

Note:
Checksums are not as unique as hashes like MD5/SHA/....
So, in case of many files you might consider crafting some hybrid solution of checksums and hashes. Possible solution might be calculating checksum first, if they match, only then calculate MD5 to make sure if they are the same or not.

P.S. Also check this answer for more info about checksums vs usual hashcodes.

like image 39
Just Shadow Avatar answered Oct 25 '22 01:10

Just Shadow