Most frequent character in range

Question

I have a string s of length n. What is the most efficient data structure / algorithm to use for finding the most frequent character in range i..j?

The string doesn't change over time, I just need to repeat queries that ask for the most frequent char among s[i], s[i + 1], ... , s[j].

Luchian Grigore · Accepted Answer

An array in which you hold the number of occurences of each character. You increase the respective value while iterating throught the string once. While doing this, you can remember the current max in the array; alternitively, look for the highest value in the array at the end.

Pseudocode

arr = [0]
for ( char in string )
   arr[char]++
mostFrequent = highest(arr)

Pseudocode

arr = [0]
for ( char in string )
   arr[char]++
mostFrequent = highest(arr)

Ivaylo Strandjev · Answer

Do a single iteration over the array and for each position remember how many occurances of each character are there up to that position. So something like this:

"abcdabc"

for index 0:

count['a'] = 1
count['b'] = 0
etc...

for index 1:

....
count['a'] = 1
count['b'] = 1
count['c'] = 0
etc...

for index 2:

....
count['a'] = 1
count['b'] = 1
count['c'] = 1
....

And so on. For index 6:

....
count['a'] = 2
count['b'] = 2
count['c'] = 2
count['d'] = 1
... all others are 0

After you compute this array you can get the number of occurances of a given letter in an interval (i, j) in constant time - simply compute count[j] - count[i-1] (careful here for i = 0!).

So for each query you will have to iterate over all letters not over all characters in the interval and thus instead of iterating over 10^6 characters you will only pass over at most 128(assuming you only have ASCII symbols).

A drawback - you need more memory, depending on the size of the alphabet you are using.

ssegvic · Answer

If you wish to get efficient results on intervals, you can build an integral distribution vector at each index of your sequence. Then by subtracting integral distributions at j+1 and i you can obtain the distribution at the interval from s[i],s[i+1],...,s[j].

Some pseudocode in Python follows. I assume your characters are chars, hence 256 distribution entries.

def buildIntegralDistributions(s):
    IDs=[]        # integral distribution
    D=[0]*256
    IDs.append(D[:])
    for x in s:
        D[ord(x)]+=1
        IDs.append(D[:])
    return IDs

def getIntervalDistribution(IDs, i,j):
    D=[0]*256        
    for k in range(256):
        D[k]=IDs[j][k]-IDs[i][k]
    return D

s='abababbbb'
IDs=buildIntegralDistributions(s)
Dij=getIntervalDistribution(IDs, 2,4)

>>> s[2:4]
'ab'
>>> Dij[ord('a')]  # how many 'a'-s in s[2:4]?
1
>>> Dij[ord('b')]  # how many 'b'-s in s[2:4]?
1

TemplateRex · Answer

You need to specify your algorithmic requirements in terms of space and time complexity.

If you insist on O(1) space complexity, just sorting (e.g. using lexicographic ordering of bits if there is no natural comparison operator available) and counting the number of occurances of the highest element will give you O(N log N) time complexity.

If you insist on O(N) time complexity, use @Luchian Grigore 's solution which also takes O(N) space complexity (well, O(K) for K-letter alphabet).

Most frequent character in range

Tags:

c++

complexity-theory

character

mode

frequency

Momonga

4 Answers

Luchian Grigore

Ivaylo Strandjev

ssegvic

TemplateRex

Recent Activity

Donate For Us

Most frequent character in range

Tags:

c++

complexity-theory

character

mode

frequency

Momonga

4 Answers

Luchian Grigore

Ivaylo Strandjev

ssegvic

TemplateRex

Related questions

Recent Activity

Donate For Us