Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently index a file?

I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).

The lines can be of a different length.

In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => positioni.e:

// maps each line to it's corresponding fileStream.position in the file    
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
  1. go through the entire file
  2. when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping

We then use an API similar to:

public void ReadLine(int lineNumber)
{
     var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
     //... set the stream position, read the byte array, convert to string etc.
}

This solution is currently providing a good performance however there are two things I do not like:

  1. Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need;
  2. Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.

Any ideas are very much appreciated.

like image 929
MaYaN Avatar asked Apr 12 '16 23:04

MaYaN


People also ask

What is the best way to index files?

If you're manually indexing, best practice is called the double key method. In the double key method, two separate people both label each scanned document with all the necessary indexing terms, typing the information they see into appropriate metadata fields for the file.

What are indexing techniques?

Indexing is a data structure technique to efficiently retrieve records from the database files based on some attributes on which the indexing has been done. Indexing in database systems is similar to what we see in books. Indexing is defined based on its indexing attributes.

Is indexing faster?

Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.


1 Answers

  1. Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.

  2. If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.

like image 182
smead Avatar answered Oct 14 '22 13:10

smead