I need to read a large space-seperated text file and count the number of instances of each code in the file. Essentially, these are the results of running some experiments hundreds of thousands of times. The system spits out a text file that looks kind of like this:
A7PS A8PN A6PP23 ...
And there are literally hundreds of thousands of these entries and I need to count the occurances of each of the codes.
I guess I could just open a StreamReader
and go through line by line, splitting on the space character. Seeing if the code has already been encountered and adding 1 to the count of that code. However, that is probably pretty naive, given the size of the data.
Anyone know of an efficient algorithm to handle this sort of processing?
UPDATE :
OK, so the consensus seems to be my approach is along the right lines
What I'd be interested to hear are things like - which is more efficient - StreamReader. TextReader, BinaryReader
What is the best structure to store my dictionary of results? HashTable, SortedList, HybridDictionary
If there are no line breaks ion the file (I haven't been given a sample yet) will just splitting the whole thing on a space be inefficient?
Essentially, I am looking at making it as performant as possible
thanks again
Your approach looks fine.
I would say that in general your approach is right but there is scope for parallelism. I would suggest that you start multiple threads or tasks (in .NET 4) each parsing part/chunk of file. Also instead of reading line by line, read in chunk of bytes - will give better performance from disk IO perspective.
Edit: Here's the outline of solution.
Of course, I am assuming really huge files for taking this approach. I will probably use old style character lookup in buffer to find word boundary marking lookup code as unsafe to avoid bound checks.
I agree with the comment by PoweRoy: why not try it out? Maybe there is no problem in practice.
If you do need something else, you could try writing some code that takes a Stream
and returns an IEnumerable<string>
. It would read characters from its input one at a time - if you need buffering for efficiency you can always wrap the FileStream
you are actually giving this code in a BufferStream
- and checks if it's a space (or possible an EOL?). If it isn't, it will add the character to a string buffer (perhaps a StringBuilder
?), but if it is it will yield return
the current string buffer and clear it.
After that you can just foreach
over the result of calling this code on the content of the file and you will get the codes from the file one by one.
You could then use some kind of data structure like a Dictionary<string,int>
to count the number of occurrances for each code, keeping the code as key, and the count as value. But this step would be same if you read the file line by line and use string.Split
to split them on spaces.
If you want to try something different, you could trying using a BinaryReader
, and read the stream byte by byte, and increase a counter by one everytime you encounter a space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With