I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map
, where map[i][j]
element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>())
(new line) and map[i].Add(position)
(new word). I will then retrieve specific word position with map[i][j]
.
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List
reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Data structures for indexingB-trees are the most commonly used data structures for indexes as they are time-efficient for lookups, deletions, and insertions. All these operations can be done in logarithmic time. Data that is stored inside of a B-tree can be sorted.
The idea of Big Data indexing is to fragment the datasets according to criteria that will be used frequently in query[14]. The fragments are indexed with each containing value satisfying some query predicates. This is aimed at storing the data in a more organized manner, thereby easing information retrieval.
Columns with one or more of the following characteristics are good candidates for indexing: Values are unique in the column, or there are few duplicates. There is a wide range of values (good for regular indexes). There is a small range of values (good for bitmap indexes).
Simply stated, indexing is a data structure technique that collects, parses, and stores data to enhance the speed and performance of retrieving and analyzing relevant documents. Indexes are used to quickly locate data without having to search every row in a database table every time a table is accessed.
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With