Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much every character occurs in the given string

Tags:

c++

c

high-load

I need to calculate how many times every character occurs in the given string. I need to do it on C or C++, I can use any library. The problem is that I am not a C/C++ developer, so I am not sure that my code is optimal. I want to get the best performance algorithm, it is the main reason for this question.

I am using the following code at the moment:

using namespace std;
...

char* text;        // some text, may be very long
int text_length;   // I know this value, if it can help

map<char,int> table;
map<char,int>::iterator it;

for(int i = 0; c = text[i]; i++) {
    it = table.find(c);
    if (it2 == table.end()) {
        table[c] = 1;
    } else {
        table[c]++;
    }
}

I may use any other structure except std::map, but I do not know which structure is better.

Thanks for your help!

like image 539
Dmitrii Tarasov Avatar asked Jan 26 '26 06:01

Dmitrii Tarasov


2 Answers

You are doing it right using bucket sort. There cannot be a faster (non-parallel) algorithm for counting elements in a finite universe (such as characters).

If you only use ASCII characters, you can use a simple array int table[256] to avoid the overhead of C++ containers.

Using Duff's device (which is actually slower on some CPUs nowadays):

int table[256];
memset(table, 0, sizeof(table));
int iterations = (text_length+7) / 8;
switch(count % 8){
    case 0:      do {    table[ *(text++) ]++;
    case 7:              table[ *(text++) ]++;
    case 6:              table[ *(text++) ]++;
    case 5:              table[ *(text++) ]++;
    case 4:              table[ *(text++) ]++;
    case 3:              table[ *(text++) ]++;
    case 2:              table[ *(text++) ]++;
    case 1:              table[ *(text++) ]++;
                 } while(--iterations > 0);
}

Update: As MRAB remarked, processing text chunks in parallel might give you a perfomance boost. But be aware that creating a thread is quite expensive, so you should measure, what the lowest amount of characters is, which justifies the thread creation time.

like image 115
kay Avatar answered Jan 27 '26 22:01

kay


You could make an array of 256 ints. one for each character.

Initialize them all to 0, then for each character you see increase the cell in the table with that ascii value.

like image 45
Yochai Timmer Avatar answered Jan 27 '26 21:01

Yochai Timmer