Processing Huge Files In C#

Question

I have a 4Gb file that I want to perform a byte based find and replace on. I have written a simple program to do it but it takes far too long (90 minutes+) to do just one find and replace. A few hex editors I have tried can perform the task in under 3 minutes and don't load the entire target file into memory. Does anyone know a method where I can accomplish the same thing? Here is my current code:

    public int ReplaceBytes(string File, byte[] Find, byte[] Replace)
    {
        var Stream = new FileStream(File, FileMode.Open, FileAccess.ReadWrite);
        int FindPoint = 0;
        int Results = 0;
        for (long i = 0; i < Stream.Length; i++)
        {
            if (Find[FindPoint] == Stream.ReadByte())
            {
                FindPoint++;
                if (FindPoint > Find.Length - 1)
                {
                    Results++;
                    FindPoint = 0;
                    Stream.Seek(-Find.Length, SeekOrigin.Current);
                    Stream.Write(Replace, 0, Replace.Length);
                }
            }
            else
            {
                FindPoint = 0;
            }
        }
        Stream.Close();
        return Results;
    }

Find and Replace are relatively small compared with the 4Gb "File" by the way. I can easily see why my algorithm is slow but I am not sure how I could do it better.

Spencer Ruport · Accepted Answer

Part of the problem may be that you're reading the stream one byte at a time. Try reading larger chunks and doing a replace on those. I'd start with about 8kb and then test with some larger or smaller chunks to see what gives you the best performance.

Lou Franco · Answer

There are lots of better algorithms for finding a substring in a string (which is basically what you are doing)

Start here:

http://en.wikipedia.org/wiki/String_searching_algorithm

The gist of them is that you can skip a lot of bytes by analyzing your substring. Here's a simple example

4GB File starts with: A B C D E F G H I J K L M N O P

Your substring is: N O P

You skip the length of the substring-1 and check against the last byte, so compare C to P
It doesn't match, so the substring is not the first 3 bytes
Also, C isn't in the substring at all, so you can skip 3 more bytes (len of substring)
Compare F to P, doesn't match, F isn't in substring, skip 3
Compare I to P, etc, etc

If you match, go backwards. If the character doesn't match, but is in the substring, then you have to do some more comparing at that point (read the link for details)

Processing Huge Files In C#

Tags:

c#

replace

large-files

byte

cgimusic

2 Answers

Spencer Ruport

Lou Franco

Recent Activity

Donate For Us

Processing Huge Files In C#

Tags:

c#

replace

large-files

byte

cgimusic

2 Answers

Spencer Ruport

Lou Franco

Related questions

Recent Activity

Donate For Us