Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FileInfo.Length != sum of all line length

Tags:

c#

I'm trying to make a progress bar for big file's reading. I set the progress bar's maximum value to FileInfo.Length, I read each line using StreamReader.ReadLine and compute the sum of each line length (with String.Length) to set the progress bar's current value.

What I noticed is that there is a difference between the file's total length and the sum of the length of each line. For example : FileInfo.Length= 25577646 Sum of all line length = 25510563

Why is there such a difference ?

Thanks for your help !

like image 771
SeyoS Avatar asked Dec 20 '22 05:12

SeyoS


2 Answers

You aren't adding the end-of-lines. It could be from 1 to 4 bytes, depending on the encoding or if it is a \n or a \r or a \r\n (1 byte = UTF8 + \n, 4 bytes = UTF16 + \r\n)

Note that with ReadLine it isn't possible to check which end-of-line (\n or \r or \r\n it encountered)

From ReadLine:

A line is defined as a sequence of characters followed by a line feed ("\n"), a carriage return ("\r"), or a carriage return immediately followed by a line feed ("\r\n")

Other problem: if your file is UTF8, then C# char length is different from byte length: è is one char in C# (that uses UTF16), 2 chars in UTF8. You could:

int len = Encoding.UTF8.GetByteCount(line);
like image 183
xanatos Avatar answered Dec 21 '22 23:12

xanatos


Two problems here:

  • string.Length gives you the number of characters in each string, whereas FileInfo.Length gives you the number of bytes. Those can be very different things, depending on the characters and the encoding used
  • You're not including the line breaks (typically \n or \r\n) as those are removed when reading lines with TextReader.ReadLine

In terms of what to do about this...

  • You presumably know the file's encoding, so you could convert each line back into bytes by calling Encoding.GetBytes to account for that difference. It would be pretty wasteful to do this though.
  • If you know the line break used by the file, you could just add the relevant number of bytes for each line you read
  • You could keep a reference to the underlying stream and use Stream.Position to detect how far through the file you've actually read. That won't necessarily be the same as the amount of data you've processed though, as the StreamReader will have a buffer. (So you may well "see" that the Stream has read all the data even though you haven't processed all the lines yet.)

The last idea is probably the cleanest, IMO.

like image 35
Jon Skeet Avatar answered Dec 21 '22 23:12

Jon Skeet