Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data structure for indexing big file

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.

I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].

The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.

Are there any other problems with the data structure I chose for the task? Which structure could be better?

UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.

like image 496
vorou Avatar asked Mar 17 '13 07:03

vorou


People also ask

What data structure does indexing use?

Data structures for indexingB-trees are the most commonly used data structures for indexes as they are time-efficient for lookups, deletions, and insertions. All these operations can be done in logarithmic time. Data that is stored inside of a B-tree can be sorted.

Can indexes be created in big data?

The idea of Big Data indexing is to fragment the datasets according to criteria that will be used frequently in query[14]. The fragments are indexed with each containing value satisfying some query predicates. This is aimed at storing the data in a more organized manner, thereby easing information retrieval.

Which column is good for indexing?

Columns with one or more of the following characteristics are good candidates for indexing: Values are unique in the column, or there are few duplicates. There is a wide range of values (good for regular indexes). There is a small range of values (good for bitmap indexes).

What is index in big data?

Simply stated, indexing is a data structure technique that collects, parses, and stores data to enhance the speed and performance of retrieving and analyzing relevant documents. Indexes are used to quickly locate data without having to search every row in a database table every time a table is accessed.


1 Answers

  1. Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
  2. I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
  3. You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx

UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

like image 81
fithu Avatar answered Oct 21 '22 15:10

fithu