Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently parsing a large text file in C#

I need to read a large space-seperated text file and count the number of instances of each code in the file. Essentially, these are the results of running some experiments hundreds of thousands of times. The system spits out a text file that looks kind of like this:

A7PS A8PN A6PP23 ...

And there are literally hundreds of thousands of these entries and I need to count the occurances of each of the codes.

I guess I could just open a StreamReader and go through line by line, splitting on the space character. Seeing if the code has already been encountered and adding 1 to the count of that code. However, that is probably pretty naive, given the size of the data.

Anyone know of an efficient algorithm to handle this sort of processing?

UPDATE :

OK, so the consensus seems to be my approach is along the right lines

What I'd be interested to hear are things like - which is more efficient - StreamReader. TextReader, BinaryReader

What is the best structure to store my dictionary of results? HashTable, SortedList, HybridDictionary

If there are no line breaks ion the file (I haven't been given a sample yet) will just splitting the whole thing on a space be inefficient?

Essentially, I am looking at making it as performant as possible

thanks again

like image 810
ChrisCa Avatar asked Aug 27 '10 11:08

ChrisCa


4 Answers

Your approach looks fine.

  1. Read in line per line
  2. Split each line by space
  3. Add a record to a dictionary if it doesn't exist yet and if it does exist, do the value++
like image 165
Carra Avatar answered Sep 18 '22 10:09

Carra


I would say that in general your approach is right but there is scope for parallelism. I would suggest that you start multiple threads or tasks (in .NET 4) each parsing part/chunk of file. Also instead of reading line by line, read in chunk of bytes - will give better performance from disk IO perspective.

Edit: Here's the outline of solution.

  1. Let's say we will process M chunks of N characters at the time (because we want to limit amount of memory needed and number of threads used).
  2. Allocate N*M character buffer. We will use this buffer cyclically.
  3. Will use producer-consumer pattern. Producer will fill the buffer. It will try to find word boundary near chunk boundary (i.e. near every Nth character). So we will have M chunks of approx N characters with start and end index within the buffer
  4. Now launch M worker threads to process each chunk. Each worker will use its own dictionary to count words - this will eliminate need for thread synchronization.
  5. Will aggregate the results at end of iteration. The process needs to be repeated till entire file is read.

Of course, I am assuming really huge files for taking this approach. I will probably use old style character lookup in buffer to find word boundary marking lookup code as unsafe to avoid bound checks.

like image 26
VinayC Avatar answered Sep 20 '22 10:09

VinayC


I agree with the comment by PoweRoy: why not try it out? Maybe there is no problem in practice.

If you do need something else, you could try writing some code that takes a Stream and returns an IEnumerable<string>. It would read characters from its input one at a time - if you need buffering for efficiency you can always wrap the FileStream you are actually giving this code in a BufferStream - and checks if it's a space (or possible an EOL?). If it isn't, it will add the character to a string buffer (perhaps a StringBuilder?), but if it is it will yield return the current string buffer and clear it.

After that you can just foreach over the result of calling this code on the content of the file and you will get the codes from the file one by one.

You could then use some kind of data structure like a Dictionary<string,int> to count the number of occurrances for each code, keeping the code as key, and the count as value. But this step would be same if you read the file line by line and use string.Split to split them on spaces.

like image 29
peSHIr Avatar answered Sep 22 '22 10:09

peSHIr


If you want to try something different, you could trying using a BinaryReader, and read the stream byte by byte, and increase a counter by one everytime you encounter a space.

like image 21
Vivelin Avatar answered Sep 21 '22 10:09

Vivelin