Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need to pick up line terminators with StreamReader.ReadLine()

I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).

The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().

The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.

I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.

like image 208
Tony Trozzo Avatar asked Mar 20 '09 20:03

Tony Trozzo


2 Answers

I would recommend that you change your architecture to work more like a parser in a compiler.

You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.

In your case the tokens would be:

  1. Column data
  2. Comma
  3. End of Line

You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.

This has the advantages of:

  1. Doing only 1 pass over the data
  2. Only storing a max of 1 lines worth of data
  3. Reusing as much memory as possible (for the string builder and the list)
  4. It's easy to change should your requirements change

Here's a sample of what the Lexer would look like:

Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.

enum TokenType
{
    ColumnData,
    Comma,
    LineTerminator
}

class Token
{
    public TokenType Type { get; private set;}
    public string Data { get; private set;}

    public Token(TokenType type)
    {
        Type = type;
    }

    public Token(TokenType type, string data)
    {
        Type = type;
        Data = data;
    }
}

private  IEnumerable<Token> GetTokens(TextReader s)
{
   var builder = new StringBuilder();

   while (s.Peek() >= 0)
   {
       var c = (char)s.Read();
       switch (c)
       {
           case ',':
           {
               if (builder.Length > 0)
               {
                   yield return new Token(TokenType.ColumnData, ExtractText(builder));
               }
               yield return new Token(TokenType.Comma);
               break;
           }
           case '\r':
           {
                var next = s.Peek();
                if (next == '\n')
                {
                    s.Read();
                }

                if (builder.Length > 0)
                {
                    yield return new Token(TokenType.ColumnData, ExtractText(builder));
                }
                yield return new Token(TokenType.LineTerminator);
                break;
           }
           default:
               builder.Append(c);
               break;
       }

   }

   s.Read();

   if (builder.Length > 0)
   {
       yield return new Token(TokenType.ColumnData, ExtractText(builder));
   }
}

private string ExtractText(StringBuilder b)
{
    var ret = b.ToString();
    b.Remove(0, b.Length);
    return ret;
}

Your "parser" code would then look like this:

public void ConvertXLS(TextReader s)
{
    var columnData = new List<string>();
    bool lastWasColumnData = false;
    bool seenAnyData = false;

    foreach (var token in GetTokens(s))
    {
        switch (token.Type)
        {
            case TokenType.ColumnData:
            {
                 seenAnyData = true;
                 if (lastWasColumnData)
                 {
                     //TODO: do some error reporting
                 }
                 else
                 {
                     lastWasColumnData = true;
                     columnData.Add(token.Data);
                 }
                 break;
            }
            case TokenType.Comma:
            {
                if (!lastWasColumnData)
                {
                    columnData.Add(null);
                }
                lastWasColumnData = false;
                break;
            }
            case TokenType.LineTerminator:
            {
                if (seenAnyData)
                {
                    OutputLine(lastWasColumnData);
                }
                seenAnyData = false;
                lastWasColumnData = false;
                columnData.Clear();
            }
        }
    }

    if (seenAnyData)
    {
        OutputLine(columnData);
    }
}
like image 122
Scott Wisniewski Avatar answered Nov 15 '22 22:11

Scott Wisniewski


You can't change StreamReader to return the line terminators, and you can't change what it uses for line termination.

I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.

It sounds like you may need to work character by character, or possibly load the whole file first and do a global replace, e.g.

x.Replace("\r\n", "\u0000") // Or some other unused character
 .Replace("\n", "\\x0A") // Or whatever escaping you need
 .Replace("\u0000", "\r\n") // Replace the real line breaks

I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.

like image 38
Jon Skeet Avatar answered Nov 15 '22 23:11

Jon Skeet