Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I handle line breaks in a CSV file using C#?

Tags:

c#

csv

I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:

"John","23","555-5555"

"Peter","24","555-5
555"

"Mary,"21","555-5555"

When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.

How should I handle these line breaks?


Thanks everybody very much for your help.

Here's is what I've done so far. My records have fixed format and all start with

JTW;...;....;...;

JTW;...;...;....

JTW;....;...;..

..;...;... (wrong record, line break inserted)

JTW;...;...

So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)

I'm having problems now because I'm saving the file as a txt.

By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.

So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?

Here is my code:

namespace EditorCSV
{
    class Program
    {
        static void Main(string[] args)
        {
            ReadFromFile("c:\\source.csv");
        }

        static void ReadFromFile(string filename)
        {
            StreamReader SR;
            StreamWriter SW;
            SW = File.CreateText("c:\\target.csv");
            string S;
            char C='a';
            int i=0;
            SR=File.OpenText(filename);
            S=SR.ReadLine();
            SW.Write(S);
            S = SR.ReadLine();
            while(S!=null)
            {
                try { C = S[3]; }
                catch (IndexOutOfRangeException exception){
                    bool t = false;
                    while (t == false)
                    {
                        t = true;
                        S = SR.ReadLine();
                        try { C = S[3]; }
                        catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }

                    }
                }
                if( C.Equals(';'))
                {
                    SW.Write("\r\n" + S);
                    i = i + 1;
                }
                else
                {
                    SW.Write(S);

                }
                S=SR.ReadLine();
            }
            SR.Close();
            SW.Close();
            Console.WriteLine("Records Processed: " + i.ToString() + " .");
            Console.WriteLine("File Created SucacessFully");
            Console.ReadKey();


        }

    }
} 
like image 895
user144658 Avatar asked Jul 24 '09 17:07

user144658


People also ask

How do you handle a line break in CSV?

To embed a newline in an Excel cell, press Alt+Enter. Then save the file as a . csv. You'll see that the double-quotes start on one line and each new line in the file is considered an embedded newline in the cell.

Can CSV have line breaks?

Since the line break \n is almost always used as a row terminator, it has a special meaning in CSV files and can therefore confuse a parser when it occurs within a text column.

What causes line breaks in CSV?

By default, when you export to CSV files, fields that have multiple lines of text, such as description fields, will be collapsed to a single line of text. This is because such line breaks may cause problems when you import into another application, such as Excel.


8 Answers

Heed the advice from the experts and Don't roll your own CSV parser.

Your first thought is, "How do I handle new line breaks?"

Your next thought is, "I need to handle commas inside of quotes."

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.

like image 69
Judah Gabriel Himango Avatar answered Oct 06 '22 17:10

Judah Gabriel Himango


CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.

Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.

Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.

like image 42
Michael La Voie Avatar answered Oct 06 '22 16:10

Michael La Voie


Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.

I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!

like image 31
Doug Avatar answered Oct 06 '22 15:10

Doug


There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):

public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
    var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
    parser.SetDelimiters(separators);
    while (!parser.EndOfData)
        yield return parser.ReadFields();
}

If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

like image 24
Bhavesh Modi Avatar answered Oct 06 '22 16:10

Bhavesh Modi


I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):

private void Parse(TextReader reader)
    {
        var row = new List<string>();
        var isStringBlock = false;
        var sb = new StringBuilder();

        long charIndex = 0;
        int currentLineCount = 0;

        while (reader.Peek() != -1)
        {
            charIndex++;

            char c = (char)reader.Read();

            if (c == '"')
                isStringBlock = !isStringBlock;

            if (c == separator && !isStringBlock) //end of word
            {
                row.Add(sb.ToString().Trim()); //add word
                sb.Length = 0;
            }
            else if (c == '\n' && !isStringBlock) //end of line
            {
                row.Add(sb.ToString().Trim()); //add last word in line
                sb.Length = 0;

                //DO SOMETHING WITH row HERE!

                currentLineCount++;

                row = new List<string>();
            }
            else
            {
                if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
            }
        }

        row.Add(sb.ToString().Trim()); //add last word

        //DO SOMETHING WITH LAST row HERE!
    }
like image 20
Zoman Avatar answered Oct 06 '22 15:10

Zoman


Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.

like image 37
Josh Close Avatar answered Oct 06 '22 16:10

Josh Close


Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.

like image 34
Freddy Avatar answered Oct 06 '22 15:10

Freddy


What I usually do is read the text in character by character opposed to line by line, due to this very problem.

As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.

like image 30
John Avatar answered Oct 06 '22 16:10

John