I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).
The lines can be of a different length.
In order to reduce GC
and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => position
i.e:
// maps each line to it's corresponding fileStream.position in the file
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
new line
increment lineCount
and add the fileStream.Position
to the _lineNumberToFileStreamPositionMapping
We then use an API similar to:
public void ReadLine(int lineNumber)
{
var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
//... set the stream position, read the byte array, convert to string etc.
}
This solution is currently providing a good performance however there are two things I do not like:
array
therefore I have to use a List<int>
which has the potential inefficiency of resizing to double of what I actually need;Any ideas are very much appreciated.
If you're manually indexing, best practice is called the double key method. In the double key method, two separate people both label each scanned document with all the necessary indexing terms, typing the information they see into appropriate metadata fields for the file.
Indexing is a data structure technique to efficiently retrieve records from the database files based on some attributes on which the indexing has been done. Indexing in database systems is similar to what we see in books. Indexing is defined based on its indexing attributes.
Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.
Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.
If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With