Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to calculate frequency distribution for array in C#?

I am just wondering what is the best approach for that calculation. Let's assume I have an input array of values and array of boundaries - I wanted to calculate/bucketize frequency distribution for each segment in boundaries array.

Is it good idea to use bucket search for that?

Actually I found that question Calculating frequency distribution of a collection with .Net/C#

But I do not understand how to use buckets for that purpose cause the size of each bucket can be different in my situation.

EDIT: After all discussions I have inner/outer loop solution, but still I want to eliminate the inner loop with a Dictionary to get O(n) performance in that case, if I understood correctly I need to hash input values into a bucket index. So we need some sort of hash function with O(1) complexity? Any ideas how to do it?

like image 527
Andrey Avatar asked Aug 31 '11 15:08

Andrey


People also ask

How do you calculate frequency distribution?

To do this, divide the frequency by the total number of results and multiply by 100. In this case, the frequency of the first row is 1 and the total number of results is 10. The percentage would then be 10.0. The final column is Cumulative percentage.

What is frequency array explain with example?

A Frequency array is an array of frequencies according to variate values, that is to say, a frequency distribution. The term “array” is often used for the individual frequency distributions which form the separate rows and columns of a bivariate frequency table.


2 Answers

Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop.

int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
    for(int i = 0; i < buckets.length - 1; i++)
    {
         if(d >= buckets[i] && d < buckets[i+1])
         {
             freq[i]++;
             break;
         }
    }
}

It's also O(n^2) worst case but you can't beat the code simplicity. I wouldn't worry about optimization until it becomes a real issue. If you have a larger bucket array, you could use a binary search of some sort. But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit.

like image 78
drharris Avatar answered Sep 19 '22 15:09

drharris


If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach:

  • First of all sort your input array. If you work with real-world data I would recommend to consider Timsort - Wiki for this. It provides very good performance guarantees for a patterns that can be seen in real-world data.

  • Traverse through sorted array and compare it with the first value in the array of boundaries:

    • If value in input array is less then boundary - increment frequency counter for this boundary
    • If value in input array is bigger then boundary - go to the next value in array of boundaries and increment the counter for new boundary.

In a code it can look like this:

Timsort(myArray);
int boundPos; 
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()

for (int i = 0; i<myArray.Lenght; i++) {
  if (myArray[i]<boundaries[boundPos]) { 
     boundaries[boubdPos]++;
  }
  else {
    boundPos++;
    boundaries[boubdPos]++;
  }
}
like image 36
Andrey Taptunov Avatar answered Sep 20 '22 15:09

Andrey Taptunov