Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a very large file by chunks and not line-by-line

Tags:

c#

file

I want to read a CSV file which can be at a size of hundreds of GBs and even TB. I got a limitation that I can only read the file in chunks of 32MB. My solution to the problem, not only does it work kinda slow, but it can also break a line in the middle of it.

I wanted to ask if you know of a better solution:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
    {
        var stream = new StreamReader(new MemoryStream(buffer));
        while ((line = stream.ReadLine()) != null)
        {
            //process line
        }

    }
}

Please do not respond with a solution which reads the file line by line (for example File.ReadLines is NOT an acceptable solution). Why? Because I'm just searching for another solution...

like image 577
Yonatan Nir Avatar asked Jan 15 '14 11:01

Yonatan Nir


People also ask

How do I read large files?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.

How do I read a long text file in Python?

To read a text file in Python, you follow these steps: First, open a text file for reading by using the open() function. Second, read text from the text file using the file read() , readline() , or readlines() method of the file object. Third, close the file using the file close() method.


2 Answers

The problem with your solution is that you recreate the streams in each iteration. Try this version:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
StringBuilder currentLine = new StringBuilder();

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    var memoryStream = new MemoryStream(buffer);
    var stream = new StreamReader(memoryStream);
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0)
    {
        memoryStream.Seek(0, SeekOrigin.Begin);

        while (!stream.EndOfStream)
        {
            line = ReadLineWithAccumulation(stream, currentLine);

            if (line != null)
            {
                //process line
            }
        }
    }
}

private string ReadLineWithAccumulation(StreamReader stream, StringBuilder currentLine)
{
    while (stream.Read(buffer, 0, 1) > 0)
    {
        if (charBuffer [0].Equals('\n'))
        {
            string result = currentLine.ToString();
            currentLine.Clear();

            if (result.Last() == '\r') //remove if newlines are single character
            {
                result = result.Substring(0, result.Length - 1);
            }

            return result;
        }
        else
        {
            currentLine.Append(charBuffer [0]);
        }
    }

    return null;  //line not complete yet
}

private char[] charBuffer = new char[1];

NOTE: This needs some tweaking if newlines are two characters long and you need newline characters to be contained in the result. The worst case would be newline pair "\r\n" split across two blocks. However since you were using ReadLine I assumed that you don't need this.

Also, the problem is that in case your whole data contains only one line, this will end up in an attempt to read the whole data into memory anyway.

like image 104
BartoszKP Avatar answered Oct 23 '22 21:10

BartoszKP


which can be at a size of hundreds of GBs and even TB

For a large file processing the most suitable class recommended is MemoryMappedFile Class

Some advantages:

  • It is ideal to access a data file on disk without performing file I/O operations and from buffering the file’s content. This works great when you deal with large data files.

  • You can use memory mapped files to allow multiple processes running on the same machine to share data with each other.

so try it and you will note the difference as swapping between memory and harddisk is a time consuming operation

like image 33
BRAHIM Kamel Avatar answered Oct 23 '22 19:10

BRAHIM Kamel