Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to parse text with custom delimiters and some very, very large field values in C#?

Tags:

c#

parsing

csv

bulk

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).

While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.

Here's an example input:

Field delimiter = 
quote character = þ

þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...

Edit: So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.

like image 991
llamaoo7 Avatar asked Dec 17 '25 16:12

llamaoo7


2 Answers

Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.

It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.

If for some reason that doesn't do it for you, try just reading line by line with a string.split:

public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
    string line;
    while ((line = input.ReadLine()) != null)
    {
        yield return line.Split('þ');
    }
}

That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).

Here's a good sample use of it:

using (StreamReader sr = new StreamReader("c:\\test.file"))
{
    var qry = from l in CreateEnumerable(sr).Skip(1)
              where l[3].Contains("something")
              select new { Field1 = l[0], Field2 = l[1] };
    foreach (var item in qry)
    {
        Console.WriteLine(item.Field1 + " , " + item.Field2);
    }
}
Console.ReadLine();

This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.

like image 154
TheSoftwareJedi Avatar answered Dec 20 '25 05:12

TheSoftwareJedi


Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.

This is with the understanding that you want to use C#/.NET, and according to Joe Duffy

18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed code.

I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.

As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.

like image 27
RandomNickName42 Avatar answered Dec 20 '25 06:12

RandomNickName42



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!