Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The n-gram that is the most frequent one among all the words

I came across the following programming interview problem:

Challenge 1: N-grams

An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and an n-gram length Your task is to

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

Note that your function will receive the following arguments:

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram

Data constraints

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

Efficiency constraints

• your function is expected to print the result in less than 2 seconds

Example Input text: “aaaab a0a baaab c”

Output aaa ngramLength: 3

Explanation

For the input presented above the 3-grams sorted by frequency are:

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...

If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....

Is that the fastest option possible?

Thank you!!!

like image 652
andrestoga Avatar asked Sep 04 '14 00:09

andrestoga


2 Answers

If implementation efficiency is what matters and you are using C, I would initialize an array of pointers to the starts of n-grams in the string, use qsort to sort the pointers according to the n-gram that they are part of, and then loop over that sorted array and figure out counts.

This should execute fast enough, and there is no need to code any fancy data structures.

like image 171
btilly Avatar answered Nov 15 '22 18:11

btilly


So the basic recipe for this problem would be:

  1. Find all n-grams in string
  2. Map all duplicate entries into a new structure that has the n-gram and the number of times it occurs

You can find my c++ solution here: http://ideone.com/MNFSis

Given:

const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };

Then, for the first step - this is the code:

const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) { 
    //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
    if (noSpace(ptr)) {
        //safely copy our current n-gram over to the ngrams array
        //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
        //are valid letters
        for (int i=0; i<NGRAM; i++) {
            ngrams[idx][i] = ptr[i];
        }
        ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
        idx++;
    }
    ptr++;
}

At this point, we have a list of all n-grams. Lets find the most popular one:

FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list

for (int i=0; i<NGRAMS; i++) {
    if (ngrams[i][0] == '\0') break;
    //insertFreqNode takes a start node, this where we will start to search for duplicates
    //the simplest description is like this:
    //  1 we search from head down each child, if we find a node that has text equal to
    //    ngrams[i] then we update it's frequency count
    //  2 if the freq is >= to the current winner we place this as head.next
    //  3 after program is complete, our most popular nodes will be the first nodes
    //    I have not implemented sorting of these - it's an exercise for the reader ;)
    insertFreqNode(&head, ngrams[i]);
}

//as the list is ordered, head.next will always be the most popular n-gram
cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl

Good luck to you!

like image 39
AlexanderBrevig Avatar answered Nov 15 '22 18:11

AlexanderBrevig