I have a huge text file around 2GB which I am trying to parse in C#. The file has custom delimiters for rows and columns. I want to parse the file and extract the data and write to another file by inserting column header and replacing RowDelimiter by newline and ColumnDelimiter by tab so that I can get the data in tabular format.
sample data:
1'~'2'~'3#####11'~'12'~'13
RowDelimiter: #####
ColumnDelimiter: '~'
I keep on getting System.OutOfMemoryException
on the following line
while ((line = rdr.ReadLine()) != null)
public void ParseFile(string inputfile,string outputfile,string header)
{
using (StreamReader rdr = new StreamReader(inputfile))
{
string line;
while ((line = rdr.ReadLine()) != null)
{
using (StreamWriter sw = new StreamWriter(outputfile))
{
//Write the Header row
sw.Write(header);
//parse the file
string[] rows = line.Split(new string[] { ParserConstants.RowSeparator },
StringSplitOptions.None);
foreach (string row in rows)
{
string[] columns = row.Split(new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
}
Console.WriteLine("File Parsing completed");
}
}
}
Read the data into a buffer and then do your parsing.
using (StreamReader rdr = new StreamReader(inputfile))
using (StreamWriter sw = new StreamWriter(outputfile))
{
char[] buffer = new char[256];
int read;
//Write the Header row
sw.Write(header);
string remainder = string.Empty;
while ((read = rdr.Read(buffer, 0, 256)) > 0)
{
string bufferData = new string(buffer, 0, read);
//parse the file
string[] rows = bufferData.Split(
new string[] { ParserConstants.RowSeparator },
StringSplitOptions.None);
rows[0] = remainder + rows[0];
int completeRows = rows.Length - 1;
remainder = rows.Last();
foreach (string row in rows.Take(completeRows))
{
string[] columns = row.Split(
new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
}
if(reamainder.Length > 0)
{
string[] columns = remainder.Split(
new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
Console.WriteLine("File Parsing completed");
}
As mentioned already in the comments you won't be able to use ReadLine
to handle this, you'll have to essentially process the data one byte - or character - at a time. The good news is that this is basically how ReadLine
works anyway, so we're not losing a lot in this case.
Using a StreamReader
we can read a series of characters from the source stream (in whatever encoding you need) into an array. Using that and a StringBuilder
we can process the stream in chunks and check for separator sequences on the way.
Here's a method that will handle an arbitrary delimiter:
public static IEnumerable<string> ReadDelimitedRows(StreamReader reader, string delimiter)
{
char[] delimChars = delimiter.ToArray();
int matchCount = 0;
char[] buffer = new char[512];
int rc = 0;
StringBuilder sb = new StringBuilder();
while ((rc = reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i = 0; i < rc; i++)
{
char c = buffer[i];
if (c == delimChars[matchCount])
{
if (++matchCount >= delimChars.Length)
{
// found full row delimiter
yield return sb.ToString();
sb.Clear();
matchCount = 0;
}
}
else
{
if (matchCount > 0)
{
// append previously matched portion of the delimiter
sb.Append(delimChars.Take(matchCount));
matchCount = 0;
}
sb.Append(c);
}
}
}
// return the last row if found
if (sb.Length > 0)
yield return sb.ToString();
}
This should handle any cases where part of your block delimiter can appear in the actual data.
In order to translate your file from the input format you describe to a simple tab-delimited format you could do something along these lines:
const string RowDelimiter = "#####";
const string ColumnDelimiter = "'~'";
using (var reader = new StreamReader(inputFilename))
using (var writer = new StreamWriter(File.Create(ouputFilename)))
{
foreach (var row in ReadDelimitedRows(reader, RowDelimiter))
{
writer.Write(row.Replace(ColumnDelimiter, "\t"));
}
}
That should process fairly quickly without eating up too much memory. Some adjustments might be required for non-ASCII output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With