Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing Huge Files In C#

I have a 4Gb file that I want to perform a byte based find and replace on. I have written a simple program to do it but it takes far too long (90 minutes+) to do just one find and replace. A few hex editors I have tried can perform the task in under 3 minutes and don't load the entire target file into memory. Does anyone know a method where I can accomplish the same thing? Here is my current code:

    public int ReplaceBytes(string File, byte[] Find, byte[] Replace)
    {
        var Stream = new FileStream(File, FileMode.Open, FileAccess.ReadWrite);
        int FindPoint = 0;
        int Results = 0;
        for (long i = 0; i < Stream.Length; i++)
        {
            if (Find[FindPoint] == Stream.ReadByte())
            {
                FindPoint++;
                if (FindPoint > Find.Length - 1)
                {
                    Results++;
                    FindPoint = 0;
                    Stream.Seek(-Find.Length, SeekOrigin.Current);
                    Stream.Write(Replace, 0, Replace.Length);
                }
            }
            else
            {
                FindPoint = 0;
            }
        }
        Stream.Close();
        return Results;
    }

Find and Replace are relatively small compared with the 4Gb "File" by the way. I can easily see why my algorithm is slow but I am not sure how I could do it better.

like image 731
cgimusic Avatar asked Apr 30 '12 17:04

cgimusic


2 Answers

Part of the problem may be that you're reading the stream one byte at a time. Try reading larger chunks and doing a replace on those. I'd start with about 8kb and then test with some larger or smaller chunks to see what gives you the best performance.

like image 131
Spencer Ruport Avatar answered Sep 29 '22 13:09

Spencer Ruport


There are lots of better algorithms for finding a substring in a string (which is basically what you are doing)

Start here:

http://en.wikipedia.org/wiki/String_searching_algorithm

The gist of them is that you can skip a lot of bytes by analyzing your substring. Here's a simple example

4GB File starts with: A B C D E F G H I J K L M N O P

Your substring is: N O P

  1. You skip the length of the substring-1 and check against the last byte, so compare C to P
  2. It doesn't match, so the substring is not the first 3 bytes
  3. Also, C isn't in the substring at all, so you can skip 3 more bytes (len of substring)
  4. Compare F to P, doesn't match, F isn't in substring, skip 3
  5. Compare I to P, etc, etc

If you match, go backwards. If the character doesn't match, but is in the substring, then you have to do some more comparing at that point (read the link for details)

like image 41
Lou Franco Avatar answered Sep 29 '22 11:09

Lou Franco