Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better way to calculate the frequency of all symbols in a file?

Okay, so, say I have a text file (not necessarily containing every possible symbol) and I'd like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access each symbol and its frequency from most frequent to least frequent. The symbols are not necessarily ASCII characters, they could be arbitrary byte sequences, albeit all of the same length.

I was considering doing something like this (in pseudocode):

function add_to_heap (symbol)
    freq = heap.find(symbol).frequency
    if (freq.exists? == true)
        freq++
    else
        symbol.freq = 1
        heap.insert(symbol)

MaxBinaryHeap heap
while somefile != EOF
    symbol = read_byte(somefile)
    heap.add_to_heap(symbol)
heap.sort_by_frequency()

while heap.root != empty
    root = heap.extract_root()
    do_stuff(root)

I was wondering: is there a better, simpler way to calculate and store how many times each symbol occurs in a file?

like image 921
rps Avatar asked Oct 04 '11 19:10

rps


2 Answers

You can always use a HashMap isntead of the Heap. Like this you'll be performing operations that are in O(1) for each symbol found instead of O(log n) wheres n is the number of items currently on the heap.

However, if te number of distinct symbols is bounded by a reasonable number (1 Byte is ideal, 2 Byte should be still fine), you can just use an array of that size and again have O(1) but with a significantly lower constant cost.

like image 79
b.buchhold Avatar answered Nov 15 '22 08:11

b.buchhold


If you're looking for a "best" solution based on running times, here's what I'd suggest:

When you're reading the file, you should have your symbols sorted (or hashed) by the value of the symbols themselves, not their frequencies. This'll let you find the current symbol in your list of already seen symbols quickly, rather than having to search through your entire list. You should also have that initial structure be able to perform fast inserts - I'd recommend a binary tree of a hash.

Once you've read all your symbols, then you should switch your ordering to be based on the frequency counts. I'd read everything into an array and then perform an in-place sort, but there are a bunch of equivalent ways to do this.

Hope this helps!

like image 32
Xavier Holt Avatar answered Nov 15 '22 08:11

Xavier Holt