Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CSV Parsing Options with .NET [closed]

Tags:

c#

.net

parsing

I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs.

So my options appear to be:

  1. Regex.Split
  2. TextFieldParser
  3. OLEDB CSV Parser

I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether):

101, Bob, "Keeps his house ""clean"".
Needs to work on laundry."
102, Amy, "Brilliant.
Driven.
Diligent."

The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier.

The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse:

var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";

A proper parsing of string "first" would be:

  • This
  • Is,A,Record
  • That "Cannot", they say,
  • _
  • _
  • be
  • rightly
  • parsed
  • at all

The '_' simply means that a blank was captured - I don't want a literal underbar to appear.

One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file.

Now for a dive into the technical options.

REGEX

First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex:

var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();

The results, applied to string "first," are quite wonderful:

  • "This"
  • "Is,A,Record"
  • "That ""Cannot"", they say,"
  • ""
  • _
  • "be"
  • rightly
  • "parsed"
  • at all

It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent!

But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine.

So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious.

On to the next option.

TextFieldParser

Three lines of code will make the problem with this option apparent:

var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;

The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over.

OLEDB

A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good.

HELP

So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?).

What say ye?

like image 861
Brent Arias Avatar asked Mar 09 '12 22:03

Brent Arias


People also ask

How do I process a CSV file in C#?

C# CSV read data into objectsGlobalization; using CsvHelper; using var streamReader = File. OpenText("users. csv"); using var csvReader = new CsvReader(streamReader, CultureInfo. CurrentCulture); var users = csvReader.

What is CSVParser?

Class CSVParserParses CSV files according to the specified format. Because CSV appears in many different dialects, the parser supports many formats by allowing the specification of a CSVFormat . The parser works record wise. It is not possible to go back, once a record has been parsed from the input stream.


2 Answers

Did you try searching for an already-existing .NET CSV parser? This one claims to handle multi-line records significantly faster than OLEDB.

like image 112
Dour High Arch Avatar answered Sep 21 '22 14:09

Dour High Arch


I wrote this a while back as a lightweight, standalone CSV parser. I believe it meets all of your requirements. Give it a try with the knowledge that it probably isn't bulletproof.

If it does work for you, feel free to change the namespace and use without restriction.

namespace NFC.Portability
{
    using System;
    using System.Collections.Generic;
    using System.Data;
    using System.IO;
    using System.Linq;
    using System.Text;

    /// <summary>
    /// Loads and reads a file with comma-separated values into a tabular format.
    /// </summary>
    /// <remarks>
    /// Parsing assumes that the first line will always contain headers and that values will be double-quoted to escape double quotes and commas.
    /// </remarks>
    public unsafe class CsvReader
    {
        private const char SEGMENT_DELIMITER = ',';
        private const char DOUBLE_QUOTE = '"';
        private const char CARRIAGE_RETURN = '\r';
        private const char NEW_LINE = '\n';

        private DataTable _table = new DataTable();

        /// <summary>
        /// Gets the data contained by the instance in a tabular format.
        /// </summary>
        public DataTable Table
        {
            get
            {
                // validation logic could be added here to ensure that the object isn't in an invalid state

                return _table;
            }
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="path">The fully-qualified path to the file from which the instance will be populated.</param>
        public CsvReader( string path )
        {
            if( path == null )
            {
                throw new ArgumentNullException( "path" );
            }

            FileStream fs = new FileStream( path, FileMode.Open );
            Read( fs );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="stream">The stream from which the instance will be populated.</param>
        public CsvReader( Stream stream )
        {
            if( stream == null )
            {
                throw new ArgumentNullException( "stream" );
            }

            Read( stream );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="bytes">The array of bytes from which the instance will be populated.</param>
        public CsvReader( byte[] bytes )
        {
            if( bytes == null )
            {
                throw new ArgumentNullException( "bytes" );
            }

            MemoryStream ms = new MemoryStream();
            ms.Write( bytes, 0, bytes.Length );
            ms.Position = 0;

            Read( ms );
        }

        private void Read( Stream s )
        {
            string lines;

            using( StreamReader sr = new StreamReader( s ) )
            {
                lines = sr.ReadToEnd();
            }

            if( string.IsNullOrWhiteSpace( lines ) )
            {
                throw new InvalidOperationException( "Data source cannot be empty." );
            }

            bool inQuotes = false;
            int lineNumber = 0;
            StringBuilder buffer = new StringBuilder( 128 );
            List<string> values = new List<string>();

            Action endSegment = () =>
            {
                values.Add( buffer.ToString() );
                buffer.Clear();
            };

            Action endLine = () =>
            {
                if( lineNumber == 0 )
                {
                    CreateColumns( values );
                    values.Clear();
                }
                else
                {
                    CreateRow( values );
                    values.Clear();
                }

                values.Clear();
                lineNumber++;
            };

            fixed( char* pStart = lines )
            {
                char* pChar = pStart;
                char* pEnd = pStart + lines.Length;

                while( pChar < pEnd ) // leave null terminator out
                {
                    if( *pChar == DOUBLE_QUOTE )
                    {
                        if( inQuotes )
                        {
                            if( Peek( pChar, pEnd ) == SEGMENT_DELIMITER )
                            {
                                endSegment();
                                pChar++;
                            }
                            else if( !ApproachingNewLine( pChar, pEnd ) )
                            {
                                buffer.Append( DOUBLE_QUOTE );
                            }
                        }

                        inQuotes = !inQuotes;
                    }
                    else if( *pChar == SEGMENT_DELIMITER )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                        }
                        else
                        {
                            buffer.Append( SEGMENT_DELIMITER );
                        }
                    }
                    else if( AtNewLine( pChar, pEnd ) )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                            endLine();
                            //pChar++;
                        }
                        else
                        {
                            buffer.Append( *pChar );
                        }
                    }
                    else
                    {
                        buffer.Append( *pChar );
                    }

                    pChar++;
                }
            }

            // append trailing values at the end of the file
            if( values.Count > 0 )
            {
                endSegment();
                endLine();
            }
        }

        /// <summary>
        /// Returns the next character in the sequence but does not advance the pointer. Checks bounds.
        /// </summary>
        /// <param name="pChar">Pointer to current character.</param>
        /// <param name="pEnd">End of range to check.</param>
        /// <returns>
        /// Returns the next character in the sequence, or char.MinValue if range is exceeded.
        /// </returns>
        private char Peek( char* pChar, char* pEnd )
        {
            if( pChar < pEnd )
            {
                return *( pChar + 1 );
            }

            return char.MinValue;
        }

        /// <summary>
        /// Determines if the current character represents a newline. This includes lookahead for two character newline delimiters.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool AtNewLine( char* pChar, char* pEnd )
        {
            if( *pChar == NEW_LINE )
            {
                return true;
            }

            if( *pChar == CARRIAGE_RETURN && Peek( pChar, pEnd ) == NEW_LINE )
            {
                return true;
            }

            return false;
        }

        /// <summary>
        /// Determines if the next character represents a newline, or the start of a newline.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool ApproachingNewLine( char* pChar, char* pEnd )
        {
            if( Peek( pChar, pEnd ) == CARRIAGE_RETURN || Peek( pChar, pEnd ) == NEW_LINE )
            {
                // technically this cheats a little to avoid a two char peek by only checking for a carriage return or new line, not both in sequence
                return true;
            }

            return false;
        }

        private void CreateColumns( List<string> columns )
        {
            foreach( string column in columns )
            {
                DataColumn dc = new DataColumn( column );
                _table.Columns.Add( dc );
            }
        }

        private void CreateRow( List<string> values )
        {
            if( values.Where( (o) => !string.IsNullOrWhiteSpace( o ) ).Count() == 0 )
            {
                return; // ignore rows which have no content
            }

            DataRow dr = _table.NewRow();
            _table.Rows.Add( dr );

            for( int i = 0; i < values.Count; i++ )
            {
                dr[i] = values[i];
            }
        }
    }
}
like image 38
Tim M. Avatar answered Sep 22 '22 14:09

Tim M.