Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a huge text file(around 2GB) with custom delimiters

Tags:

I have a huge text file around 2GB which I am trying to parse in C#. The file has custom delimiters for rows and columns. I want to parse the file and extract the data and write to another file by inserting column header and replacing RowDelimiter by newline and ColumnDelimiter by tab so that I can get the data in tabular format.

sample data:
1'~'2'~'3#####11'~'12'~'13

RowDelimiter: #####
ColumnDelimiter: '~'

I keep on getting System.OutOfMemoryException on the following line

while ((line = rdr.ReadLine()) != null)

public void ParseFile(string inputfile,string outputfile,string header)
{

    using (StreamReader rdr = new StreamReader(inputfile))
    {
        string line;

        while ((line = rdr.ReadLine()) != null)
        {
            using (StreamWriter sw = new StreamWriter(outputfile))
            {
                //Write the Header row
                sw.Write(header);

                //parse the file
                string[] rows = line.Split(new string[] { ParserConstants.RowSeparator },
                    StringSplitOptions.None);

                foreach (string row in rows)
                {
                    string[] columns = row.Split(new string[] {ParserConstants.ColumnSeparator},
                        StringSplitOptions.None);
                    foreach (string column in columns)
                    {
                        sw.Write(column + "\\t");
                    }
                    sw.Write(ParserConstants.NewlineCharacter);
                    Console.WriteLine();
                }
            }

            Console.WriteLine("File Parsing completed");

        }
    }
}
like image 973
Lizzy Avatar asked Dec 14 '17 01:12

Lizzy


2 Answers

Read the data into a buffer and then do your parsing.

using (StreamReader rdr = new StreamReader(inputfile))
using (StreamWriter sw = new StreamWriter(outputfile))
{
    char[] buffer = new char[256];
    int read;

    //Write the Header row
    sw.Write(header);

    string remainder = string.Empty;
    while ((read = rdr.Read(buffer, 0, 256)) > 0)
    {
        string bufferData = new string(buffer, 0, read);
        //parse the file
        string[] rows = bufferData.Split(
            new string[] { ParserConstants.RowSeparator },
            StringSplitOptions.None);

        rows[0] = remainder + rows[0];
        int completeRows = rows.Length - 1;
        remainder = rows.Last();
        foreach (string row in rows.Take(completeRows))
        {
            string[] columns = row.Split(
                new string[] {ParserConstants.ColumnSeparator},
                StringSplitOptions.None);
            foreach (string column in columns)
            {
                sw.Write(column + "\\t");
            }
            sw.Write(ParserConstants.NewlineCharacter);
            Console.WriteLine();
        }
    }

    if(reamainder.Length > 0)
    {
        string[] columns = remainder.Split(
        new string[] {ParserConstants.ColumnSeparator},
        StringSplitOptions.None);
        foreach (string column in columns)
        {
            sw.Write(column + "\\t");
        }
        sw.Write(ParserConstants.NewlineCharacter);
        Console.WriteLine();
    }

    Console.WriteLine("File Parsing completed");
}
like image 25
juharr Avatar answered Sep 22 '22 13:09

juharr


As mentioned already in the comments you won't be able to use ReadLine to handle this, you'll have to essentially process the data one byte - or character - at a time. The good news is that this is basically how ReadLine works anyway, so we're not losing a lot in this case.

Using a StreamReader we can read a series of characters from the source stream (in whatever encoding you need) into an array. Using that and a StringBuilder we can process the stream in chunks and check for separator sequences on the way.

Here's a method that will handle an arbitrary delimiter:

public static IEnumerable<string> ReadDelimitedRows(StreamReader reader, string delimiter)
{
    char[] delimChars = delimiter.ToArray();
    int matchCount = 0;
    char[] buffer = new char[512];
    int rc = 0;
    StringBuilder sb = new StringBuilder();

    while ((rc = reader.Read(buffer, 0, buffer.Length)) > 0)
    {
        for (int i = 0; i < rc; i++)
        {
            char c = buffer[i];
            if (c == delimChars[matchCount])
            {
                if (++matchCount >= delimChars.Length)
                {
                    // found full row delimiter
                    yield return sb.ToString();
                    sb.Clear();
                    matchCount = 0;
                }
            }
            else
            {
                if (matchCount > 0)
                {
                    // append previously matched portion of the delimiter
                    sb.Append(delimChars.Take(matchCount));
                    matchCount = 0;
                }
                sb.Append(c);
            }
        }
    }
    // return the last row if found
    if (sb.Length > 0)
        yield return sb.ToString();
}

This should handle any cases where part of your block delimiter can appear in the actual data.

In order to translate your file from the input format you describe to a simple tab-delimited format you could do something along these lines:

const string RowDelimiter = "#####";
const string ColumnDelimiter = "'~'";

using (var reader = new StreamReader(inputFilename))
using (var writer = new StreamWriter(File.Create(ouputFilename)))
{
    foreach (var row in ReadDelimitedRows(reader, RowDelimiter))
    {
        writer.Write(row.Replace(ColumnDelimiter, "\t"));
    }
}

That should process fairly quickly without eating up too much memory. Some adjustments might be required for non-ASCII output.

like image 71
Corey Avatar answered Sep 23 '22 13:09

Corey