Reading text files line by line, with exact offset/position reporting

Tags:

My simple requirement: Reading a huge (> a million) line test file (For this example assume it's a CSV of some sorts) and keeping a reference to the beginning of that line for faster lookup in the future (read a line, starting at X).

I tried the naive and easy way first, using a StreamWriter and accessing the underlying BaseStream.Position. Unfortunately that doesn't work as I intended:

Given a file containing the following

Foo
Bar
Baz
Bla
Fasel

and this very simple code

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = sr.BaseStream.Position;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos = sr.BaseStream.Position;
  }
}

the output is:

000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel

I can imagine that the stream is trying to be helpful/efficient and probably reads in (big) chunks whenever new data is necessary. For me this is bad..

The question, finally: Any way to get the (byte, char) offset while reading a file line by line without using a basic Stream and messing with \r \n \r\n and string encoding etc. manually? Not a big deal, really, I just don't like to build things that might exist already..

679

asked Apr 07 '10 16:04

Benjamin Podszun

2 Answers

This is really tough issue. After very long and exhausting enumeration of different solutions in the internet (including solutions from this thread, thank you!) I had to create my own bicycle.

I had following requirements:

Performance - reading must be very fast, so reading one char at the time or using reflection are not acceptable, so buffering is required
Streaming - file can be huge, so it is not acceptable to read it to memory entirely
Tailing - file tailing should be available
Long lines - lines can be very long, so buffer can't be limited

Stable - single byte error was immediately visible during usage. Unfortunately for me, several implementations I found were with stability problems

public class OffsetStreamReader
{
    private const int InitialBufferSize = 4096;    
    private readonly char _bom;
    private readonly byte _end;
    private readonly Encoding _encoding;
    private readonly Stream _stream;
    private readonly bool _tail;

    private byte[] _buffer;
    private int _processedInBuffer;
    private int _informationInBuffer;

    public OffsetStreamReader(Stream stream, bool tail)
    {
        _buffer = new byte[InitialBufferSize];
        _processedInBuffer = InitialBufferSize;

        if (stream == null || !stream.CanRead)
            throw new ArgumentException("stream");

        _stream = stream;
        _tail = tail;
        _encoding = Encoding.UTF8;

        _bom = '\uFEFF';
        _end = _encoding.GetBytes(new [] {'\n'})[0];
    }

    public long Offset { get; private set; }

    public string ReadLine()
    {
        // Underlying stream closed
        if (!_stream.CanRead)
            return null;

        // EOF
        if (_processedInBuffer == _informationInBuffer)
        {
            if (_tail)
            {
                _processedInBuffer = _buffer.Length;
                _informationInBuffer = 0;
                ReadBuffer();
            }

            return null;
        }

        var lineEnd = Search(_buffer, _end, _processedInBuffer);
        var haveEnd = true;

        // File ended but no finalizing newline character
        if (lineEnd.HasValue == false && _informationInBuffer + _processedInBuffer < _buffer.Length)
        {
            if (_tail)
                return null;
            else
            {
                lineEnd = _informationInBuffer;
                haveEnd = false;
            }
        }

        // No end in current buffer
        if (!lineEnd.HasValue)
        {
            ReadBuffer();
            if (_informationInBuffer != 0)
                return ReadLine();

            return null;
        }

        var arr = new byte[lineEnd.Value - _processedInBuffer];
        Array.Copy(_buffer, _processedInBuffer, arr, 0, arr.Length);

        Offset = Offset + lineEnd.Value - _processedInBuffer + (haveEnd ? 1 : 0);
        _processedInBuffer = lineEnd.Value + (haveEnd ? 1 : 0);

        return _encoding.GetString(arr).TrimStart(_bom).TrimEnd('\r', '\n');
    }

    private void ReadBuffer()
    {
        var notProcessedPartLength = _buffer.Length - _processedInBuffer;

        // Extend buffer to be able to fit whole line to the buffer
        // Was     [NOT_PROCESSED]
        // Become  [NOT_PROCESSED        ]
        if (notProcessedPartLength == _buffer.Length)
        {
            var extendedBuffer = new byte[_buffer.Length + _buffer.Length/2];
            Array.Copy(_buffer, extendedBuffer, _buffer.Length);
            _buffer = extendedBuffer;
        }

        // Copy not processed information to the begining
        // Was    [PROCESSED NOT_PROCESSED]
        // Become [NOT_PROCESSED          ]
        Array.Copy(_buffer, (long) _processedInBuffer, _buffer, 0, notProcessedPartLength);

        // Read more information to the empty part of buffer
        // Was    [ NOT_PROCESSED                   ]
        // Become [ NOT_PROCESSED NEW_NOT_PROCESSED ]
        _informationInBuffer = notProcessedPartLength + _stream.Read(_buffer, notProcessedPartLength, _buffer.Length - notProcessedPartLength);

        _processedInBuffer = 0;
    }

    private int? Search(byte[] buffer, byte byteToSearch, int bufferOffset)
    {
        for (int i = bufferOffset; i < buffer.Length - 1; i++)
        {
            if (buffer[i] == byteToSearch)
                return i;
        }
        return null;
    }
}

answered Oct 22 '22 23:10

Anton

You could create a TextReader wrapper, which would track the current position in the base TextReader :

public class TrackingTextReader : TextReader
{
    private TextReader _baseReader;
    private int _position;

    public TrackingTextReader(TextReader baseReader)
    {
        _baseReader = baseReader;
    }

    public override int Read()
    {
        _position++;
        return _baseReader.Read();
    }

    public override int Peek()
    {
        return _baseReader.Peek();
    }

    public int Position
    {
        get { return _position; }
    }
}

You could then use it as follows :

string text = @"Foo
Bar
Baz
Bla
Fasel";

using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
    string line;
    while ((line = trackingReader.ReadLine()) != null)
    {
        Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
    }
}

answered Oct 22 '22 23:10

Thomas Levesque

Related questions
                            
                                Handling end process of a windows app
                            
                                Entity Framework won't persist data in SQL Express (MDF)
                            
                                .NET Assembly Plugin Security
                            
                                C# Class/Object visualisation software
                            
                                ASP.NET Web User Control Library
                            
                                When does a using-statement box its argument, when it's a struct?
                            
                                Image resize with GDI in .NET gives low saturation
                            
                                Manual way to call COM object through .NET
                            
                                Face Recognition for classifying digital photos?
                            
                                How to unit test code that is highly complex behind the public interface
                            
                                .NET: Why is TryParseExact failing on Hmm and Hmmss?
                            
                                Best practices for multiple asserts on same result in C#
                            
                                C# type parameters specification
                            
                                How can I format a value as a percentage without the percent sign?
                            
                                C# ListBox ObservableCollection<T>
                            
                                WCF Service for many concurrent clients and database access
                            
                                Does anyone ever use the Ribbon Control? [closed]
                            
                                How to create Recent Documents History in C# in WPF Application
                            
                                Explanation of casting/conversion int/double in C#
                            
                                XPath doesn't work as desired in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading text files line by line, with exact offset/position reporting

Tags:

c#

text-files

offset

Benjamin Podszun

People also ask

2 Answers

Anton

Thomas Levesque

Recent Activity

Donate For Us