Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I know if a text file ends with carriage return or not?

I have to process a text file and check if it ends with a carriage return or not.

I have to read to whole content, make some changes and re-write it into the target file, keeping exactly the same formatting as original. And here is the problem: I don't know if the original file contains a line break or not at the end.

I've already tried:

  • the StreamReader.ReadLine() method, but the string that is returned does not contain the terminating carriage return and/or line feed.
  • also the ReadToEnd() method can be a solution, but I'm wondering about the performance in case of very big files. The solution has to be efficient.
  • getting the last 2 characters and check if them are equal to "\r\n" may resolve it, but I have to deal with lots of encodings, and it seems practically impossible to get them.

How can I efficiently read all the text of a file and determine whether it ended in a newline?

like image 368
Cristian Stirbe Avatar asked Jan 14 '17 10:01

Cristian Stirbe


2 Answers

After reading the file through ReadLine(), you can seek back to two characters before the end of the file and compare those characters to CR-LF:

string s;
using (StreamReader sr = new StreamReader(@"C:\Users\User1\Desktop\a.txt", encoding: System.Text.Encoding.UTF8))
{
    while (!sr.EndOfStream)
    {
        s = sr.ReadLine();
        //process the line we read...
    }

    //if (sr.BaseStream.Length >= 2) { //ensure file is not so small

    //back 2 bytes from end of file
    sr.BaseStream.Seek(-2, SeekOrigin.End);

    int s1 = sr.Read(); //read the char before last
    int s2 = sr.Read(); //read the last char 
    if (s2 == 10) //file is end with CR-LF or LF ... (CR=13, LF=10)
    {
        if (s1 == 13) { } //file is end with CR-LF (Windows EOL format)
        else { } //file is end with just LF, (UNIX/OSX format)
    }

}
like image 118
S.Serpooshan Avatar answered Sep 27 '22 18:09

S.Serpooshan


So you're processing a text file, meaning you need to read all text, and want to preserve any newline characters, even at the end of the file.

You've correctly concluded that ReadLine() eats those, even if the file doesn't end with one. In fact, ReadLine() eats the last carriage return when a file ends with a one (StreamReader.EndOfStream is true after reading the penultimate line). ReadAllText() also eats the last newline. Given you're potentially dealing with large files, you also don't want to read the entire file in memory at once.

You also can't just compare the last two bytes of the file, because there are encodings that use more than one byte to encode a character, such as UTF-16. So you'll need to read the file being encoding-aware. A StreamReader does just that.

So a solution would be to create your own version of ReadLine(), which includes the newline character(s) at the end:

public static class StreamReaderExtensions
{
    public static string ReadLineWithNewLine(this StreamReader reader)
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            int c = reader.Read();

            builder.Append((char) c);
            if (c == 10)
            {
                break;
            }
        }

        return builder.ToString();
    }
}

Then you can check the last returned line whether it ends in \n:

string line = "";

using (var stream = new StreamReader(@"D:\Temp\NewlineAtEnd.txt"))
{
    while (!stream.EndOfStream)
    {
        line = stream.ReadLineWithNewLine();
        Console.Write(line);
    }
}

Console.WriteLine();

if (line.EndsWith("\n"))
{
    Console.WriteLine("Newline at end of file");
}
else
{
    Console.WriteLine("No newline at end of file");
}

Though the StreamReader is heavily optimized, I can't vouch for the performance of reading one character at a time. A quick test using two equal 100 MB text files showed a quite drastic slowdown compared to ReadLine() (~1800 vs ~400 ms).

This approach does preserve the original line endings though, meaning you can safely rewrite a file using strings returned by this extension method, without changing all \n to \r\n or vice versa.

like image 41
CodeCaster Avatar answered Sep 27 '22 18:09

CodeCaster