I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use <code>List<List<long>> map</code>, where <code>map[i][j]</code> element is position of jth word of ith line in the file. I will build the index sequentially, i.e. read the whole file and populating index with <code>map.Add(new List<long>())</code> (new line) and <code>map[i].Add(position)</code> (new word). I will then retrieve specific word position with <code>map[i][j]</code>. The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every <code>List</code> reallocation, no idea of how I can avoid this. Are there any other problems with the data structure I chose for the task? Which structure could be better? UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.

<ol> <li>Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.</li> <li>I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).</li> <li>You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx </li> </ol> UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

data structure for indexing big file

Tags:

c#

algorithm

complexity-theory

list

data-structures

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.

I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].

The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.

Are there any other problems with the data structure I chose for the task? Which structure could be better?

UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.

496

asked Mar 17 '13 07:03

vorou

1 Answers

Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx

UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

answered Oct 21 '22 15:10

fithu

Related questions
                            
                                Can't decrypt these strings
                            
                                When is it possible to call Finalize in Dispose?
                            
                                How to use Pascal Casing and Camel Casing for Short Acronyms in C#?
                            
                                Concatenate DateTime string with Arabic String
                            
                                Implement auto log-out + warning in asp.net + jquery?
                            
                                ServiceStack OrmLite - Handling Default and Computed columns
                            
                                UNC path pointing to local directory much slower than local access
                            
                                Is WCF's DataContractSerilaizer thread safe?
                            
                                Proper way to construct linq queries to achieve fastest performance?
                            
                                Given two sets of numbers, find the smallest set from each where the sum is equal
                            
                                ASP.Net WebApi: How to load additional controllers at runtime
                            
                                await forbidden in catch clause. Looking for a work arround
                            
                                Are there any algorithms to categorize an array among certain patterns?
                            
                                How does one retrieve the hash code of an enumeration without boxing it?
                            
                                How to write Test Cases?
                            
                                Is hardcoding controller, view and action names in MVC good practice?
                            
                                Which SendAsync method is called when a HttpClientHandler is passed to HttpClient
                            
                                Protobuf-net object reference deserialization using Dictionary: A reference-tracked object changed reference during deserialization
                            
                                C# parallel foreach equally finishing tasks
                            
                                Saving in entity framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With