Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading text files line by line, with exact offset/position reporting

My simple requirement: Reading a huge (> a million) line test file (For this example assume it's a CSV of some sorts) and keeping a reference to the beginning of that line for faster lookup in the future (read a line, starting at X).

I tried the naive and easy way first, using a StreamWriter and accessing the underlying BaseStream.Position. Unfortunately that doesn't work as I intended:

Given a file containing the following

Foo
Bar
Baz
Bla
Fasel

and this very simple code

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = sr.BaseStream.Position;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos = sr.BaseStream.Position;
  }
}

the output is:

000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel

I can imagine that the stream is trying to be helpful/efficient and probably reads in (big) chunks whenever new data is necessary. For me this is bad..

The question, finally: Any way to get the (byte, char) offset while reading a file line by line without using a basic Stream and messing with \r \n \r\n and string encoding etc. manually? Not a big deal, really, I just don't like to build things that might exist already..

like image 679
Benjamin Podszun Avatar asked Apr 07 '10 16:04

Benjamin Podszun


People also ask

What is line offset in file?

An offset into a file is simply the character location within that file, usually starting with 0; thus "offset 240" is actually the 241st byte in the file.

How do I put text in a specific position of a file in Python?

seek() method In Python, seek() function is used to change the position of the File Handle to a given specific position. File handle is like a cursor, which defines from where the data has to be read or written in the file.


2 Answers

This is really tough issue. After very long and exhausting enumeration of different solutions in the internet (including solutions from this thread, thank you!) I had to create my own bicycle.

I had following requirements:

  • Performance - reading must be very fast, so reading one char at the time or using reflection are not acceptable, so buffering is required
  • Streaming - file can be huge, so it is not acceptable to read it to memory entirely
  • Tailing - file tailing should be available
  • Long lines - lines can be very long, so buffer can't be limited
  • Stable - single byte error was immediately visible during usage. Unfortunately for me, several implementations I found were with stability problems

    public class OffsetStreamReader
    {
        private const int InitialBufferSize = 4096;    
        private readonly char _bom;
        private readonly byte _end;
        private readonly Encoding _encoding;
        private readonly Stream _stream;
        private readonly bool _tail;
    
        private byte[] _buffer;
        private int _processedInBuffer;
        private int _informationInBuffer;
    
        public OffsetStreamReader(Stream stream, bool tail)
        {
            _buffer = new byte[InitialBufferSize];
            _processedInBuffer = InitialBufferSize;
    
            if (stream == null || !stream.CanRead)
                throw new ArgumentException("stream");
    
            _stream = stream;
            _tail = tail;
            _encoding = Encoding.UTF8;
    
            _bom = '\uFEFF';
            _end = _encoding.GetBytes(new [] {'\n'})[0];
        }
    
        public long Offset { get; private set; }
    
        public string ReadLine()
        {
            // Underlying stream closed
            if (!_stream.CanRead)
                return null;
    
            // EOF
            if (_processedInBuffer == _informationInBuffer)
            {
                if (_tail)
                {
                    _processedInBuffer = _buffer.Length;
                    _informationInBuffer = 0;
                    ReadBuffer();
                }
    
                return null;
            }
    
            var lineEnd = Search(_buffer, _end, _processedInBuffer);
            var haveEnd = true;
    
            // File ended but no finalizing newline character
            if (lineEnd.HasValue == false && _informationInBuffer + _processedInBuffer < _buffer.Length)
            {
                if (_tail)
                    return null;
                else
                {
                    lineEnd = _informationInBuffer;
                    haveEnd = false;
                }
            }
    
            // No end in current buffer
            if (!lineEnd.HasValue)
            {
                ReadBuffer();
                if (_informationInBuffer != 0)
                    return ReadLine();
    
                return null;
            }
    
            var arr = new byte[lineEnd.Value - _processedInBuffer];
            Array.Copy(_buffer, _processedInBuffer, arr, 0, arr.Length);
    
            Offset = Offset + lineEnd.Value - _processedInBuffer + (haveEnd ? 1 : 0);
            _processedInBuffer = lineEnd.Value + (haveEnd ? 1 : 0);
    
            return _encoding.GetString(arr).TrimStart(_bom).TrimEnd('\r', '\n');
        }
    
        private void ReadBuffer()
        {
            var notProcessedPartLength = _buffer.Length - _processedInBuffer;
    
            // Extend buffer to be able to fit whole line to the buffer
            // Was     [NOT_PROCESSED]
            // Become  [NOT_PROCESSED        ]
            if (notProcessedPartLength == _buffer.Length)
            {
                var extendedBuffer = new byte[_buffer.Length + _buffer.Length/2];
                Array.Copy(_buffer, extendedBuffer, _buffer.Length);
                _buffer = extendedBuffer;
            }
    
            // Copy not processed information to the begining
            // Was    [PROCESSED NOT_PROCESSED]
            // Become [NOT_PROCESSED          ]
            Array.Copy(_buffer, (long) _processedInBuffer, _buffer, 0, notProcessedPartLength);
    
            // Read more information to the empty part of buffer
            // Was    [ NOT_PROCESSED                   ]
            // Become [ NOT_PROCESSED NEW_NOT_PROCESSED ]
            _informationInBuffer = notProcessedPartLength + _stream.Read(_buffer, notProcessedPartLength, _buffer.Length - notProcessedPartLength);
    
            _processedInBuffer = 0;
        }
    
        private int? Search(byte[] buffer, byte byteToSearch, int bufferOffset)
        {
            for (int i = bufferOffset; i < buffer.Length - 1; i++)
            {
                if (buffer[i] == byteToSearch)
                    return i;
            }
            return null;
        }
    }
    
like image 84
Anton Avatar answered Oct 22 '22 23:10

Anton


You could create a TextReader wrapper, which would track the current position in the base TextReader :

public class TrackingTextReader : TextReader
{
    private TextReader _baseReader;
    private int _position;

    public TrackingTextReader(TextReader baseReader)
    {
        _baseReader = baseReader;
    }

    public override int Read()
    {
        _position++;
        return _baseReader.Read();
    }

    public override int Peek()
    {
        return _baseReader.Peek();
    }

    public int Position
    {
        get { return _position; }
    }
}

You could then use it as follows :

string text = @"Foo
Bar
Baz
Bla
Fasel";

using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
    string line;
    while ((line = trackingReader.ReadLine()) != null)
    {
        Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
    }
}
like image 41
Thomas Levesque Avatar answered Oct 22 '22 23:10

Thomas Levesque