My simple requirement: Reading a huge (> a million) line test file (For this example assume it's a CSV of some sorts) and keeping a reference to the beginning of that line for faster lookup in the future (read a line, starting at X).
I tried the naive and easy way first, using a StreamWriter
and accessing the underlying BaseStream.Position
. Unfortunately that doesn't work as I intended:
Given a file containing the following
Foo
Bar
Baz
Bla
Fasel
and this very simple code
using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
string line;
long pos = sr.BaseStream.Position;
while ((line = sr.ReadLine()) != null) {
Console.Write("{0:d3} ", pos);
Console.WriteLine(line);
pos = sr.BaseStream.Position;
}
}
the output is:
000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel
I can imagine that the stream is trying to be helpful/efficient and probably reads in (big) chunks whenever new data is necessary. For me this is bad..
The question, finally: Any way to get the (byte, char) offset while reading a file line by line without using a basic Stream and messing with \r \n \r\n and string encoding etc. manually? Not a big deal, really, I just don't like to build things that might exist already..
An offset into a file is simply the character location within that file, usually starting with 0; thus "offset 240" is actually the 241st byte in the file.
seek() method In Python, seek() function is used to change the position of the File Handle to a given specific position. File handle is like a cursor, which defines from where the data has to be read or written in the file.
This is really tough issue. After very long and exhausting enumeration of different solutions in the internet (including solutions from this thread, thank you!) I had to create my own bicycle.
I had following requirements:
Stable - single byte error was immediately visible during usage. Unfortunately for me, several implementations I found were with stability problems
public class OffsetStreamReader
{
private const int InitialBufferSize = 4096;
private readonly char _bom;
private readonly byte _end;
private readonly Encoding _encoding;
private readonly Stream _stream;
private readonly bool _tail;
private byte[] _buffer;
private int _processedInBuffer;
private int _informationInBuffer;
public OffsetStreamReader(Stream stream, bool tail)
{
_buffer = new byte[InitialBufferSize];
_processedInBuffer = InitialBufferSize;
if (stream == null || !stream.CanRead)
throw new ArgumentException("stream");
_stream = stream;
_tail = tail;
_encoding = Encoding.UTF8;
_bom = '\uFEFF';
_end = _encoding.GetBytes(new [] {'\n'})[0];
}
public long Offset { get; private set; }
public string ReadLine()
{
// Underlying stream closed
if (!_stream.CanRead)
return null;
// EOF
if (_processedInBuffer == _informationInBuffer)
{
if (_tail)
{
_processedInBuffer = _buffer.Length;
_informationInBuffer = 0;
ReadBuffer();
}
return null;
}
var lineEnd = Search(_buffer, _end, _processedInBuffer);
var haveEnd = true;
// File ended but no finalizing newline character
if (lineEnd.HasValue == false && _informationInBuffer + _processedInBuffer < _buffer.Length)
{
if (_tail)
return null;
else
{
lineEnd = _informationInBuffer;
haveEnd = false;
}
}
// No end in current buffer
if (!lineEnd.HasValue)
{
ReadBuffer();
if (_informationInBuffer != 0)
return ReadLine();
return null;
}
var arr = new byte[lineEnd.Value - _processedInBuffer];
Array.Copy(_buffer, _processedInBuffer, arr, 0, arr.Length);
Offset = Offset + lineEnd.Value - _processedInBuffer + (haveEnd ? 1 : 0);
_processedInBuffer = lineEnd.Value + (haveEnd ? 1 : 0);
return _encoding.GetString(arr).TrimStart(_bom).TrimEnd('\r', '\n');
}
private void ReadBuffer()
{
var notProcessedPartLength = _buffer.Length - _processedInBuffer;
// Extend buffer to be able to fit whole line to the buffer
// Was [NOT_PROCESSED]
// Become [NOT_PROCESSED ]
if (notProcessedPartLength == _buffer.Length)
{
var extendedBuffer = new byte[_buffer.Length + _buffer.Length/2];
Array.Copy(_buffer, extendedBuffer, _buffer.Length);
_buffer = extendedBuffer;
}
// Copy not processed information to the begining
// Was [PROCESSED NOT_PROCESSED]
// Become [NOT_PROCESSED ]
Array.Copy(_buffer, (long) _processedInBuffer, _buffer, 0, notProcessedPartLength);
// Read more information to the empty part of buffer
// Was [ NOT_PROCESSED ]
// Become [ NOT_PROCESSED NEW_NOT_PROCESSED ]
_informationInBuffer = notProcessedPartLength + _stream.Read(_buffer, notProcessedPartLength, _buffer.Length - notProcessedPartLength);
_processedInBuffer = 0;
}
private int? Search(byte[] buffer, byte byteToSearch, int bufferOffset)
{
for (int i = bufferOffset; i < buffer.Length - 1; i++)
{
if (buffer[i] == byteToSearch)
return i;
}
return null;
}
}
You could create a TextReader
wrapper, which would track the current position in the base TextReader
:
public class TrackingTextReader : TextReader
{
private TextReader _baseReader;
private int _position;
public TrackingTextReader(TextReader baseReader)
{
_baseReader = baseReader;
}
public override int Read()
{
_position++;
return _baseReader.Read();
}
public override int Peek()
{
return _baseReader.Peek();
}
public int Position
{
get { return _position; }
}
}
You could then use it as follows :
string text = @"Foo
Bar
Baz
Bla
Fasel";
using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
string line;
while ((line = trackingReader.ReadLine()) != null)
{
Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With