I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB). The lines can be of a different length. In order to reduce <code>GC</code> and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of <code>lineNo => position</code>i.e: <pre class="prettyprint"><code>// maps each line to it's corresponding fileStream.position in the file List<int> _lineNumberToFileStreamPositionMapping = new List<int>(); </code></pre> <ol> <li>go through the entire file</li> <li>when detect a <code>new line</code> increment <code>lineCount</code> and add the <code>fileStream.Position</code> to the <code>_lineNumberToFileStreamPositionMapping</code> </li> </ol> We then use an API similar to: <pre class="prettyprint"><code>public void ReadLine(int lineNumber) { var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber]; //... set the stream position, read the byte array, convert to string etc. } </code></pre> This solution is currently providing a good performance however there are two things I do not like: <ol> <li>Since I do not know the total number of lines in the file, I cannot preallocate an <code>array</code> therefore I have to use a <code>List<int></code> which has the potential inefficiency of resizing to double of what I actually need;</li> <li>Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.</li> </ol> Any ideas are very much appreciated.

<ol> <li>Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.</li> <li>If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines. </li> </ol>

How can I efficiently index a file?

Tags:

c#

.net

indexing

filestream

I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).

The lines can be of a different length.

In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => positioni.e:

// maps each line to it's corresponding fileStream.position in the file    
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();

go through the entire file
when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping

We then use an API similar to:

public void ReadLine(int lineNumber)
{
     var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
     //... set the stream position, read the byte array, convert to string etc.
}

This solution is currently providing a good performance however there are two things I do not like:

Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need;
Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.

Any ideas are very much appreciated.

929

asked Apr 12 '16 23:04

MaYaN

1 Answers

Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.
If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.

182

answered Oct 14 '22 13:10

smead

Related questions
                            
                                EF and TPT : the column name is specified more than once in the SET clause
                            
                                What is wrong with my sorting?
                            
                                How can I make my trie more efficient?
                            
                                Restore ListView state MVVM
                            
                                Missing Visual Studio Web Application publish targets
                            
                                ASP.NET 5 (ASP.NET Core) - unable to run with IIS
                            
                                Install Matlab MCR as part of the C# Installer Wizard
                            
                                Allow only one checkbox selected in DataGridView
                            
                                Constant Visual Studio Update Required popup for Monogame 3.4 and Windows 10 UAP project
                            
                                Disable grid before event fire WPF
                            
                                WPF 2D in 3D view Animations : Performance issue
                            
                                Can I rely on CallContext using Web API?
                            
                                How to correctly inherit thread culture in a parallel block?
                            
                                Using Fluent.NHibernate, how can I do an Insert based on a Select statement
                            
                                When versioning my API, how do I maintain swagger documentation if I use the same DTO?
                            
                                OpenCL Cloo: Out of Resources Error
                            
                                Check if Generic Interface Member is "Pure" (has Pure Attribute)
                            
                                Preventing Web API from executing AT ALL if the EnableCors Origin is invalid
                            
                                How to run PowerShell in x64 or x86 using C#?
                            
                                How to do extreme branding/internationalization in .NET

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With