Read a very large file by chunks and not line-by-line

Tags:

I want to read a CSV file which can be at a size of hundreds of GBs and even TB. I got a limitation that I can only read the file in chunks of 32MB. My solution to the problem, not only does it work kinda slow, but it can also break a line in the middle of it.

I wanted to ask if you know of a better solution:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
    {
        var stream = new StreamReader(new MemoryStream(buffer));
        while ((line = stream.ReadLine()) != null)
        {
            //process line
        }

    }
}

Please do not respond with a solution which reads the file line by line (for example File.ReadLines is NOT an acceptable solution). Why? Because I'm just searching for another solution...

577

asked Jan 15 '14 11:01

Yonatan Nir

2 Answers

The problem with your solution is that you recreate the streams in each iteration. Try this version:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
StringBuilder currentLine = new StringBuilder();

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    var memoryStream = new MemoryStream(buffer);
    var stream = new StreamReader(memoryStream);
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0)
    {
        memoryStream.Seek(0, SeekOrigin.Begin);

        while (!stream.EndOfStream)
        {
            line = ReadLineWithAccumulation(stream, currentLine);

            if (line != null)
            {
                //process line
            }
        }
    }
}

private string ReadLineWithAccumulation(StreamReader stream, StringBuilder currentLine)
{
    while (stream.Read(buffer, 0, 1) > 0)
    {
        if (charBuffer [0].Equals('\n'))
        {
            string result = currentLine.ToString();
            currentLine.Clear();

            if (result.Last() == '\r') //remove if newlines are single character
            {
                result = result.Substring(0, result.Length - 1);
            }

            return result;
        }
        else
        {
            currentLine.Append(charBuffer [0]);
        }
    }

    return null;  //line not complete yet
}

private char[] charBuffer = new char[1];

NOTE: This needs some tweaking if newlines are two characters long and you need newline characters to be contained in the result. The worst case would be newline pair "\r\n" split across two blocks. However since you were using ReadLine I assumed that you don't need this.

Also, the problem is that in case your whole data contains only one line, this will end up in an attempt to read the whole data into memory anyway.

104

answered Oct 23 '22 21:10

BartoszKP

which can be at a size of hundreds of GBs and even TB

For a large file processing the most suitable class recommended is MemoryMappedFile Class

Some advantages:

It is ideal to access a data file on disk without performing file I/O operations and from buffering the file’s content. This works great when you deal with large data files.
You can use memory mapped files to allow multiple processes running on the same machine to share data with each other.

so try it and you will note the difference as swapping between memory and harddisk is a time consuming operation

answered Oct 23 '22 19:10

BRAHIM Kamel

Related questions
                            
                                Is HttpContext.Current.Items thread-safe between Requests?
                            
                                C# Code to generate strings that match a regex [closed]
                            
                                How to get all combinations of several List<int> [duplicate]
                            
                                Linq SUM on objects?
                            
                                Is casting enum to int expensive? [duplicate]
                            
                                Create Error with custom text that prevents compiling in VB.NET (#error in C#)
                            
                                What type of exception is EEMessageException?
                            
                                The use of UncommonField<T> in WPF
                            
                                Get native resolution of screen
                            
                                How to properly provide data for <audio>?
                            
                                How does Interlocked work and why is it faster than lock? [duplicate]
                            
                                FileNotFoundException when trying to load Autofac as an embedded assembly
                            
                                How to return an empty ReadOnlyCollection
                            
                                Can't write to HKEY_CURRENT_USER registrykey in C#
                            
                                Obtaining UI dispatcher in Windows phone 8
                            
                                Does Dapper's IEnumerable<T> have deferred or immediate execution?
                            
                                The DEBUG constant in Visual Studio 2012 won't activate
                            
                                Sharing assemblies using Portable Class Library with DataAnnotations
                            
                                Directory.GetFiles - different output dependent on OS
                            
                                NHibernate Could not resolve property

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With