I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs. So my options appear to be: <ol> <li>Regex.Split</li> <li> TextFieldParser </li> <li> OLEDB CSV Parser </li> </ol> I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether): <code>101, Bob, "Keeps his house ""clean"". Needs to work on laundry." 102, Amy, "Brilliant. Driven. Diligent."</code> The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier. The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse: <pre class="prettyprint"><code>var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all"; var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all"; </code></pre> A proper parsing of string "first" would be: <ul> <li>This</li> <li>Is,A,Record</li> <li>That "Cannot", they say,</li> <li>_ </li> <li>_ </li> <li>be</li> <li>rightly</li> <li>parsed</li> <li>at all</li> </ul> The '_' simply means that a blank was captured - I don't want a literal underbar to appear. One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file. Now for a dive into the technical options. REGEX First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex: <pre class="prettyprint"><code>var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"; var Regex.Split(first, regex).Dump(); </code></pre> The results, applied to string "first," are quite wonderful: <ul> <li>"This"</li> <li>"Is,A,Record"</li> <li>"That ""Cannot"", they say,"</li> <li>""</li> <li>_</li> <li>"be"</li> <li>rightly</li> <li>"parsed"</li> <li>at all</li> </ul> It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent! But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine. So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious. On to the next option. TextFieldParser Three lines of code will make the problem with this option apparent: <pre class="prettyprint"><code>var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream); reader.Delimiters = new string[] { @"|" }; reader.HasFieldsEnclosedInQuotes = true; </code></pre> The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over. OLEDB A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good. HELP So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?). What say ye?

I wrote this a while back as a lightweight, standalone CSV parser. I believe it meets all of your requirements. Give it a try with the knowledge that it probably isn't bulletproof. If it does work for you, feel free to change the namespace and use without restriction. <pre class="prettyprint"><code>namespace NFC.Portability { using System; using System.Collections.Generic; using System.Data; using System.IO; using System.Linq; using System.Text; /// <summary> /// Loads and reads a file with comma-separated values into a tabular format. /// </summary> /// <remarks> /// Parsing assumes that the first line will always contain headers and that values will be double-quoted to escape double quotes and commas. /// </remarks> public unsafe class CsvReader { private const char SEGMENT_DELIMITER = ','; private const char DOUBLE_QUOTE = '"'; private const char CARRIAGE_RETURN = '\r'; private const char NEW_LINE = '\n'; private DataTable _table = new DataTable(); /// <summary> /// Gets the data contained by the instance in a tabular format. /// </summary> public DataTable Table { get { // validation logic could be added here to ensure that the object isn't in an invalid state return _table; } } /// <summary> /// Creates a new instance of <c>CsvReader</c>. /// </summary> /// <param name="path">The fully-qualified path to the file from which the instance will be populated.</param> public CsvReader( string path ) { if( path == null ) { throw new ArgumentNullException( "path" ); } FileStream fs = new FileStream( path, FileMode.Open ); Read( fs ); } /// <summary> /// Creates a new instance of <c>CsvReader</c>. /// </summary> /// <param name="stream">The stream from which the instance will be populated.</param> public CsvReader( Stream stream ) { if( stream == null ) { throw new ArgumentNullException( "stream" ); } Read( stream ); } /// <summary> /// Creates a new instance of <c>CsvReader</c>. /// </summary> /// <param name="bytes">The array of bytes from which the instance will be populated.</param> public CsvReader( byte[] bytes ) { if( bytes == null ) { throw new ArgumentNullException( "bytes" ); } MemoryStream ms = new MemoryStream(); ms.Write( bytes, 0, bytes.Length ); ms.Position = 0; Read( ms ); } private void Read( Stream s ) { string lines; using( StreamReader sr = new StreamReader( s ) ) { lines = sr.ReadToEnd(); } if( string.IsNullOrWhiteSpace( lines ) ) { throw new InvalidOperationException( "Data source cannot be empty." ); } bool inQuotes = false; int lineNumber = 0; StringBuilder buffer = new StringBuilder( 128 ); List<string> values = new List<string>(); Action endSegment = () => { values.Add( buffer.ToString() ); buffer.Clear(); }; Action endLine = () => { if( lineNumber == 0 ) { CreateColumns( values ); values.Clear(); } else { CreateRow( values ); values.Clear(); } values.Clear(); lineNumber++; }; fixed( char* pStart = lines ) { char* pChar = pStart; char* pEnd = pStart + lines.Length; while( pChar < pEnd ) // leave null terminator out { if( *pChar == DOUBLE_QUOTE ) { if( inQuotes ) { if( Peek( pChar, pEnd ) == SEGMENT_DELIMITER ) { endSegment(); pChar++; } else if( !ApproachingNewLine( pChar, pEnd ) ) { buffer.Append( DOUBLE_QUOTE ); } } inQuotes = !inQuotes; } else if( *pChar == SEGMENT_DELIMITER ) { if( !inQuotes ) { endSegment(); } else { buffer.Append( SEGMENT_DELIMITER ); } } else if( AtNewLine( pChar, pEnd ) ) { if( !inQuotes ) { endSegment(); endLine(); //pChar++; } else { buffer.Append( *pChar ); } } else { buffer.Append( *pChar ); } pChar++; } } // append trailing values at the end of the file if( values.Count > 0 ) { endSegment(); endLine(); } } /// <summary> /// Returns the next character in the sequence but does not advance the pointer. Checks bounds. /// </summary> /// <param name="pChar">Pointer to current character.</param> /// <param name="pEnd">End of range to check.</param> /// <returns> /// Returns the next character in the sequence, or char.MinValue if range is exceeded. /// </returns> private char Peek( char* pChar, char* pEnd ) { if( pChar < pEnd ) { return *( pChar + 1 ); } return char.MinValue; } /// <summary> /// Determines if the current character represents a newline. This includes lookahead for two character newline delimiters. /// </summary> /// <param name="pChar"></param> /// <param name="pEnd"></param> /// <returns></returns> private bool AtNewLine( char* pChar, char* pEnd ) { if( *pChar == NEW_LINE ) { return true; } if( *pChar == CARRIAGE_RETURN && Peek( pChar, pEnd ) == NEW_LINE ) { return true; } return false; } /// <summary> /// Determines if the next character represents a newline, or the start of a newline. /// </summary> /// <param name="pChar"></param> /// <param name="pEnd"></param> /// <returns></returns> private bool ApproachingNewLine( char* pChar, char* pEnd ) { if( Peek( pChar, pEnd ) == CARRIAGE_RETURN || Peek( pChar, pEnd ) == NEW_LINE ) { // technically this cheats a little to avoid a two char peek by only checking for a carriage return or new line, not both in sequence return true; } return false; } private void CreateColumns( List<string> columns ) { foreach( string column in columns ) { DataColumn dc = new DataColumn( column ); _table.Columns.Add( dc ); } } private void CreateRow( List<string> values ) { if( values.Where( (o) => !string.IsNullOrWhiteSpace( o ) ).Count() == 0 ) { return; // ignore rows which have no content } DataRow dr = _table.NewRow(); _table.Rows.Add( dr ); for( int i = 0; i < values.Count; i++ ) { dr[i] = values[i]; } } } } </code></pre>

CSV Parsing Options with .NET [closed]

Tags:

c#

.net

parsing

I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs.

So my options appear to be:

Regex.Split
TextFieldParser
OLEDB CSV Parser

I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether):

101, Bob, "Keeps his house ""clean"". Needs to work on laundry." 102, Amy, "Brilliant. Driven. Diligent."

The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier.

The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse:

var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";

A proper parsing of string "first" would be:

This
Is,A,Record
That "Cannot", they say,
_
_
be
rightly
parsed
at all

The '_' simply means that a blank was captured - I don't want a literal underbar to appear.

One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file.

Now for a dive into the technical options.

REGEX

First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex:

var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();

The results, applied to string "first," are quite wonderful:

"This"
"Is,A,Record"
"That ""Cannot"", they say,"
""
_
"be"
rightly
"parsed"
at all

It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent!

But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine.

So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious.

On to the next option.

TextFieldParser

Three lines of code will make the problem with this option apparent:

var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;

The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over.

OLEDB

A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good.

HELP

So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?).

What say ye?

861

asked Mar 09 '12 22:03

Brent Arias

2 Answers

Did you try searching for an already-existing .NET CSV parser? This one claims to handle multi-line records significantly faster than OLEDB.

112

answered Sep 21 '22 14:09

Dour High Arch

I wrote this a while back as a lightweight, standalone CSV parser. I believe it meets all of your requirements. Give it a try with the knowledge that it probably isn't bulletproof.

If it does work for you, feel free to change the namespace and use without restriction.

namespace NFC.Portability
{
    using System;
    using System.Collections.Generic;
    using System.Data;
    using System.IO;
    using System.Linq;
    using System.Text;

    /// <summary>
    /// Loads and reads a file with comma-separated values into a tabular format.
    /// </summary>
    /// <remarks>
    /// Parsing assumes that the first line will always contain headers and that values will be double-quoted to escape double quotes and commas.
    /// </remarks>
    public unsafe class CsvReader
    {
        private const char SEGMENT_DELIMITER = ',';
        private const char DOUBLE_QUOTE = '"';
        private const char CARRIAGE_RETURN = '\r';
        private const char NEW_LINE = '\n';

        private DataTable _table = new DataTable();

        /// <summary>
        /// Gets the data contained by the instance in a tabular format.
        /// </summary>
        public DataTable Table
        {
            get
            {
                // validation logic could be added here to ensure that the object isn't in an invalid state

                return _table;
            }
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="path">The fully-qualified path to the file from which the instance will be populated.</param>
        public CsvReader( string path )
        {
            if( path == null )
            {
                throw new ArgumentNullException( "path" );
            }

            FileStream fs = new FileStream( path, FileMode.Open );
            Read( fs );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="stream">The stream from which the instance will be populated.</param>
        public CsvReader( Stream stream )
        {
            if( stream == null )
            {
                throw new ArgumentNullException( "stream" );
            }

            Read( stream );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="bytes">The array of bytes from which the instance will be populated.</param>
        public CsvReader( byte[] bytes )
        {
            if( bytes == null )
            {
                throw new ArgumentNullException( "bytes" );
            }

            MemoryStream ms = new MemoryStream();
            ms.Write( bytes, 0, bytes.Length );
            ms.Position = 0;

            Read( ms );
        }

        private void Read( Stream s )
        {
            string lines;

            using( StreamReader sr = new StreamReader( s ) )
            {
                lines = sr.ReadToEnd();
            }

            if( string.IsNullOrWhiteSpace( lines ) )
            {
                throw new InvalidOperationException( "Data source cannot be empty." );
            }

            bool inQuotes = false;
            int lineNumber = 0;
            StringBuilder buffer = new StringBuilder( 128 );
            List<string> values = new List<string>();

            Action endSegment = () =>
            {
                values.Add( buffer.ToString() );
                buffer.Clear();
            };

            Action endLine = () =>
            {
                if( lineNumber == 0 )
                {
                    CreateColumns( values );
                    values.Clear();
                }
                else
                {
                    CreateRow( values );
                    values.Clear();
                }

                values.Clear();
                lineNumber++;
            };

            fixed( char* pStart = lines )
            {
                char* pChar = pStart;
                char* pEnd = pStart + lines.Length;

                while( pChar < pEnd ) // leave null terminator out
                {
                    if( *pChar == DOUBLE_QUOTE )
                    {
                        if( inQuotes )
                        {
                            if( Peek( pChar, pEnd ) == SEGMENT_DELIMITER )
                            {
                                endSegment();
                                pChar++;
                            }
                            else if( !ApproachingNewLine( pChar, pEnd ) )
                            {
                                buffer.Append( DOUBLE_QUOTE );
                            }
                        }

                        inQuotes = !inQuotes;
                    }
                    else if( *pChar == SEGMENT_DELIMITER )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                        }
                        else
                        {
                            buffer.Append( SEGMENT_DELIMITER );
                        }
                    }
                    else if( AtNewLine( pChar, pEnd ) )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                            endLine();
                            //pChar++;
                        }
                        else
                        {
                            buffer.Append( *pChar );
                        }
                    }
                    else
                    {
                        buffer.Append( *pChar );
                    }

                    pChar++;
                }
            }

            // append trailing values at the end of the file
            if( values.Count > 0 )
            {
                endSegment();
                endLine();
            }
        }

        /// <summary>
        /// Returns the next character in the sequence but does not advance the pointer. Checks bounds.
        /// </summary>
        /// <param name="pChar">Pointer to current character.</param>
        /// <param name="pEnd">End of range to check.</param>
        /// <returns>
        /// Returns the next character in the sequence, or char.MinValue if range is exceeded.
        /// </returns>
        private char Peek( char* pChar, char* pEnd )
        {
            if( pChar < pEnd )
            {
                return *( pChar + 1 );
            }

            return char.MinValue;
        }

        /// <summary>
        /// Determines if the current character represents a newline. This includes lookahead for two character newline delimiters.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool AtNewLine( char* pChar, char* pEnd )
        {
            if( *pChar == NEW_LINE )
            {
                return true;
            }

            if( *pChar == CARRIAGE_RETURN && Peek( pChar, pEnd ) == NEW_LINE )
            {
                return true;
            }

            return false;
        }

        /// <summary>
        /// Determines if the next character represents a newline, or the start of a newline.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool ApproachingNewLine( char* pChar, char* pEnd )
        {
            if( Peek( pChar, pEnd ) == CARRIAGE_RETURN || Peek( pChar, pEnd ) == NEW_LINE )
            {
                // technically this cheats a little to avoid a two char peek by only checking for a carriage return or new line, not both in sequence
                return true;
            }

            return false;
        }

        private void CreateColumns( List<string> columns )
        {
            foreach( string column in columns )
            {
                DataColumn dc = new DataColumn( column );
                _table.Columns.Add( dc );
            }
        }

        private void CreateRow( List<string> values )
        {
            if( values.Where( (o) => !string.IsNullOrWhiteSpace( o ) ).Count() == 0 )
            {
                return; // ignore rows which have no content
            }

            DataRow dr = _table.NewRow();
            _table.Rows.Add( dr );

            for( int i = 0; i < values.Count; i++ )
            {
                dr[i] = values[i];
            }
        }
    }
}

answered Sep 22 '22 14:09

Tim M.

Related questions
                            
                                ASP.NET corrupt assembly "Could not load file or assembly App_Web_*"
                            
                                C#: Does ResumeLayout(true) do the same as ResumeLayout(false) + PerformLayout()?
                            
                                Is there a possibility to differ virtual printer from physical one?
                            
                                Creating a C# DLL and using it from unmanaged C++
                            
                                Get drive label in C#
                            
                                Understanding CLR 2.0 Memory Model
                            
                                The riddle of the working broken query
                            
                                How can I simulate a hanging cable in WPF?
                            
                                Moving from ASP.Net to Java for web development
                            
                                How to safely save data to an existing file with C#?
                            
                                Exception handling in RIA Service
                            
                                Lazy<T> ExecutionAndPublication - Examples That Could Cause Deadlock
                            
                                EntityFramework query manipulation, db provider wrapping, db expression trees
                            
                                Is it more or less efficient to perform a check before performing a Replace in C#?
                            
                                Invalid URI: The Uri string is too long
                            
                                How to set MSDeploy settings in .csproj file
                            
                                Serializing with ProtoBuf.NET without tagging members
                            
                                what is ICustomTypeDescriptor and when to use it?
                            
                                How to read HardDisk Temperature?
                            
                                Base class includes field but type not compatible with type of control

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With