Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't I leverage 4GB of RAM in my computer to process less than 2GB of information in C#?

Scenario: over 1.5GB of text and csv files that I need to process mathematically. I tried using SQL Server Express, but loading the information, even with BULK import takes a very long time, and ideally I need to have the entire data set in memory, to reduce hard disk IO.

There are over 120,000,000 records, but even when I attempt to filter the information to just one column (in-memory), my C# console application is consuming ~3.5GB of memory to process just 125MB (700MB actually read-in) of text.

It seems that the references to the strings and string arrays are not being collected by the GC, even after setting all references to null and encapsulating IDisposables with the using keyword.

I think the culprit is the String.Split() method which is creating a new string for each comma separated value.

You may suggest that I shouldn't even read the unneeded* columns into a string array, but that misses the point: How can I place this entire data set in memory, so I can process it in parallel in C#?

I could optimize the statistical algorithms and coordinate tasks with a sophisticated scheduling algorithm, but this is something I was hoping to do before I ran into memory problems, and not because of.

I have included a full console application that simulates my environment and should help replicate the problem.

Any help is appreciated. Thanks in advance.

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;

namespace InMemProcessingLeak
{
    class Program
    {
        static void Main(string[] args)
        {
            //Setup Test Environment. Uncomment Once
            //15000-20000 files would be more realistic
            //InMemoryProcessingLeak.GenerateTestDirectoryFilesAndColumns(3000, 3);
            //GC
            GC.Collect();
            //Demostrate Large Object Memory Allocation Problem (LOMAP)
            InMemoryProcessingLeak.SelectColumnFromAllFiles(3000, 2);
        }
    }

    class InMemoryProcessingLeak
    {
        public static List<string> SelectColumnFromAllFiles(int filesToSelect, int column)
        {
            List<string> allItems = new List<string>();
            int fileCount = filesToSelect;
            long fileSize, totalReadSize = 0;

            for (int i = 1; i <= fileCount; i++)
            {
                allItems.AddRange(SelectColumn(i, column, out fileSize));
                totalReadSize += fileSize;
                Console.Clear();
                Console.Out.WriteLine("Reading file {0:00000} of {1}", i, fileCount);
                Console.Out.WriteLine("Memory = {0}MB", GC.GetTotalMemory(false) / 1048576);
                Console.Out.WriteLine("Total Read = {0}MB", totalReadSize / 1048576);
            }
            Console.ReadLine();
            return allItems;

        }

        //reads a csv file and returns the values for a selected column
        private static List<string> SelectColumn(int fileNumber, int column, out long fileSize)
        {
            string fileIn;
            FileInfo file = new FileInfo(string.Format(@"MemLeakTestFiles/File{0:00000}.txt", fileNumber));
            fileSize = file.Length;
            using (System.IO.FileStream fs = file.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
            {
                using (System.IO.StreamReader sr = new System.IO.StreamReader(fs))
                {
                    fileIn = sr.ReadToEnd();
                }
            }

            string[] lineDelimiter = { "\n" };
            string[] allLines = fileIn.Split(lineDelimiter, StringSplitOptions.None);

            List<string> processedColumn = new List<string>();

            string current;
            for (int i = 0; i < allLines.Length - 1; i++)
            {
                current = GetColumnFromProcessedRow(allLines[i], column);
                processedColumn.Add(current);
            }

            for (int i = 0; i < lineDelimiter.Length; i++) //GC
            {
                lineDelimiter[i] = null;
            }
            lineDelimiter = null;

            for (int i = 0; i < allLines.Length; i++) //GC
            {
                allLines[i] = null;
            }
            allLines = null;
            current = null;

            return processedColumn;
        }

        //returns a row value from the selected comma separated string and column position
        private static string GetColumnFromProcessedRow(string line, int columnPosition)
        {
            string[] entireRow = line.Split(",".ToCharArray());
            string currentColumn = entireRow[columnPosition];
            //GC
            for (int i = 0; i < entireRow.Length; i++)
            {
                entireRow[i] = null;
            }
            entireRow = null;
            return currentColumn;
        }

        #region Generators
        public static void GenerateTestDirectoryFilesAndColumns(int filesToGenerate, int columnsToGenerate)
        {
            DirectoryInfo dirInfo = new DirectoryInfo("MemLeakTestFiles");
            if (!dirInfo.Exists)
            {
                dirInfo.Create();
            }
            Random seed = new Random();

            string[] columns = new string[columnsToGenerate];

            StringBuilder sb = new StringBuilder();
            for (int i = 1; i <= filesToGenerate; i++)
            {
                int rows = seed.Next(10, 8000);
                for (int j = 0; j < rows; j++)
                {
                    sb.Append(GenerateRow(seed, columnsToGenerate));
                }
                using (TextWriter tw = new StreamWriter(String.Format(@"{0}/File{1:00000}.txt", dirInfo, i)))
                {
                    tw.Write(sb.ToString());
                    tw.Flush();
                }
                sb.Remove(0, sb.Length);
                Console.Clear();
                Console.Out.WriteLine("Generating file {0:00000} of {1}", i, filesToGenerate);
            }
        }

        private static string GenerateString(Random seed)
        {
            StringBuilder sb = new StringBuilder();
            int characters = seed.Next(4, 12);
            for (int i = 0; i < characters; i++)
            {
                sb.Append(Convert.ToChar(Convert.ToInt32(Math.Floor(26 * seed.NextDouble() + 65))));
            }
            return sb.ToString();
        }

        private static string GenerateRow(Random seed, int columnsToGenerate)
        {
            StringBuilder sb = new StringBuilder();

            sb.Append(seed.Next());
            for (int i = 0; i < columnsToGenerate - 1; i++)
            {
                sb.Append(",");
                sb.Append(GenerateString(seed));
            }
            sb.Append("\n");

            return sb.ToString();
        }
        #endregion
    }
}

*These other columns will be needed and accessed both sequentially and randomly through the life of the program, so reading from disk each time is a tremendously taxing overhead.

**Environment Notes: 4GB of DDR2 SDRAM 800, Core 2 Duo 2.5Ghz, .NET Runtime 3.5 SP1, Vista 64.

like image 257
exceptionerror Avatar asked Apr 01 '09 07:04

exceptionerror


3 Answers

Yes, String.Split creates a new String object for each "piece" - that's what it's meant to do.

Now, bear in mind that strings in .NET are Unicode (UTF-16 really), and with the object overhead the cost of a string in bytes is approximately 20 + 2*n where n is the number of characters.

That means if you've got a lot of small strings, it'll take a lot of memory compared with the size of text data involved. For example, an 80 character line split into 10 x 8 character strings will take 80 bytes in the file, but 10 * (20 + 2*8) = 360 bytes in memory - a 4.5x blow-up!

I doubt that this is a GC problem - and I'd advise you to remove extra statements setting variables to null when it's not necessary - just a problem of having too much data.

What I would suggest is that you read the file line-by-line (using TextReader.ReadLine() instead of TextReader.ReadToEnd()). Clearly having the whole file in memory if you don't need to is wasteful.

like image 196
Jon Skeet Avatar answered Nov 03 '22 03:11

Jon Skeet


I would suggest reading line by line instead of the entire file, or a block of up to 1-2mb.

Update:
From Jon's comments I was curious and experimented with 4 methods:

  • StreamReader.ReadLine (default and custom buffer size),
  • StreamReader.ReadToEnd
  • My method listed above.

Reading a 180mb log file:

  • ReadLine ms: 1937
  • ReadLine bigger buffer, ascii ms: 1926
  • ReadToEnd ms: 2151
  • Custom ms: 1415

The custom StreamReader was:

StreamReader streamReader = new StreamReader(fileStream, Encoding.Default, false, 16384)

StreamReader's buffer by default is 1024.

For memory consumption (the actual question!) - ~800mb used. And the method I give still uses a StringBuilder (which uses a string) so no less memory consumption.

like image 25
Chris S Avatar answered Nov 03 '22 01:11

Chris S


Modern GC languages take advantage of the large amounts of cheap RAM to offload memeory management tasks. This imposes a certain overhead, but your typical business app doesn't really need that much information anyway. Many programs get by with less than a thousand objects. Manually managing that many is a chore, bu even a thousand bytes per-object overhead wouldn't matter.

In your case, the per-object overhead is becoming a problem. You can for instance consider representing each column as one object, implemented with a single String and an array of integer offsets. To return a single field, you return a substring (possibly as a shim)

like image 2
MSalters Avatar answered Nov 03 '22 03:11

MSalters