Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a significantly better way to find the most common word in a list (Python only)

Considering a trivial implementation of the problem, I am looking for a significantly faster way to find the most common word in a Python list. As part of Python interview I received feedback that this implementation is so inefficient, that it is basically failure. Later, I tried many algorithms I found, and only some heapsearch based solutions are a bit faster, but not overwhelmingly (when scaled to tens of millions of items, heapsearch is about 30% faster; on trivial lengths like thousand, it is almost the same; using timeit).

def stupid(words):
    freqs = {}
    for w in words:
        freqs[w] = freqs.get(w, 0) + 1
    return max(freqs, key=freqs.get)

As this is a simple problem and I have some experience (although I am nowhere algorithms guru or competitive coder) I was surprised.

Of course, I would like to improve my skills and learn that so much better way of solving the problem, so your input will be appreciated.

Clarification for duplicate status: My point is to find out if there is actually much (asymptotically) better solution and other similar questions have picked an answer that is not much better. If this is not enough to make the question unique, close this question, of course.

Update

Thank you all for the input. Regarding the interview situation, I remain with the impression that hand written search algorithm was expected (that may be somewhat more efficient) and/or the reviewer was assessing code from the point of view of another language, with different constant factors. Of course, everyone can have own standards.

What was important for me was to validate if I am totally clueless (I had the impression that I am not) or just usually write not the best possible code. It is still possible that even better algorithm exists, but if it remained hidden for the community here for a few days, I am fine with that.

I am picking the most upvoted answer - it seems fair to do so, even though more than one people privided usefull feedback.

Minor update

It seems that using defaultdict has a noticeable advantage over using the 'get' method, even if it is statically aliased.

like image 434
Petar Donchev Avatar asked Jul 08 '15 09:07

Petar Donchev


4 Answers

That sounds like a bad interview question, probably a case of the interviewer expecting a certain answer. It definitely sounds like s/he didn't clearly explain what s/he was asking.

Your solution is O(n) (where n = len(words)), and using a heap doesn't change that.

There are faster approximate solutions...

like image 112
taleinat Avatar answered Nov 01 '22 01:11

taleinat


from collections import Counter

word_counter = Counter(words)

word_counter is a dictionary with words as keys and frequencies as values, and also has a most_common() method.

like image 21
DeepSpace Avatar answered Nov 01 '22 02:11

DeepSpace


Function calls and searches of global namespace are more expensive.

Your stupid function makes 2 function calls per element in the word list. The second in your max call is entirely avoidable, iterating the keys of a dict and then for each key looking up the value using dict.get is a flagrant inefficiency when you can iterate over key-value pairs.

def stupid(words):
  freqs = {}
  for w in words:
    freqs[w] = freqs.get(w, 0) + 1
  return max(freqs, key=freqs.get)

def most_frequent(words):
  ## Build the frequency dict
  freqs = {}
  for w in words:
    if w in freqs:
      freqs[w] += 1
    else:
      freqs[w] = 1
  ## Search the frequency dict
  m_k = None
  m_v = 0
  for k, v in freqs.iteritems():
    if v > m_v:
      m_k, m_v = k, v
  return m_k, m_v

Using user1952500's single pass suggestion, how does this fare on your large sample sets?

def faster(words):
  freq = {}
  m_k = None
  m_v = 0
  for w in words:
    if w in freq:
      v = freq[w] + 1
    else:
      v = 1
    freq[w] = v
    if v > m_v:
      m_k = w
      m_v = v
  return m_k, m_v

This has the slight advantage of being stable for the multiple most-frequent values.


Comparison of all suggestions using nltk.books to produce a sample:

def word_frequency_version1(words):
  """Petar's initial"""
  freqs = {}
  for w in words:
    freqs[w] = freqs.get(w, 0) + 1
  return max(freqs, key=freqs.get)

def word_frequency_version2(words):
  """Matt's initial"""
  ## Build the frequency dict
  freqs = {}
  for w in words:
    if w in freqs:
      freqs[w] += 1
    else:
      freqs[w] = 1
  ## Search the frequency dict
  m_k = None
  m_v = 0
  for k, v in freqs.iteritems():
    if v > m_v:
      m_k, m_v = k, v
  return m_k, m_v

def word_frequency_version3(words):
  """Noting max as we go"""
  freq = {}
  m_k = None
  m_v = 0
  for w in words:
    if w in freq:
      v = freq[w] + 1
    else:
      v = 1
    freq[w] = v
    if v > m_v:
      m_k = w
      m_v = v
  return m_k, m_v

from collections import Counter
def word_frequency_version4(words):
  """Built-in Counter"""
  c = Counter(words)
  return c.most_common()[0]


from multiprocessing import Pool
def chunked(seq,count):
  v = len(seq) / count
  for i in range(count):
    yield seq[i*v:v+i*v]

def frequency_map(words):
  freq = {}
  for w in words:
    if w in freq:
      freq[w] += 1
    else:
      freq[w] = 1
  return freq

def frequency_reduce(results):
  freq = {}
  for result in results:
    for k, v in result.iteritems():
      if k in freq:
        freq[k] += v
      else:
        freq[k] = v
  m_k = None
  m_v = None
  for k, v in freq.iteritems():
      if v > m_v:
        m_k = k
        m_v = v
  return m_k, m_v

# def word_frequency_version5(words,chunks=5,pool_size=5):
#   pool = Pool(processes=pool_size)
#   result = frequency_reduce(pool.map(frequency_map,chunked(words,chunks)))
#   pool.close()
#   return result

def word_frequency_version5(words,chunks=5,pool=Pool(processes=5)):
  """multiprocessing Matt's initial suggestion"""
  return frequency_reduce(pool.map(frequency_map,chunked(words,chunks)))

def word_frequency_version6(words):
  """Petar's one-liner"""
  return max(set(words),key=words.count)


import timeit
freq1 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version1 as func; print func.__doc__')
freq2 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version2 as func; print func.__doc__')
freq3 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version3 as func; print func.__doc__')
freq4 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version4 as func; print func.__doc__')
freq5 = timeit.Timer('func(words,chunks=chunks)','from __main__ import words, word_frequency_version5 as func; print func.__doc__; chunks=10')
freq6 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version6 as func; print func.__doc__')

Results:

>>> print "n={n}, m={m}".format(n=len(words),m=len(set(words)))
n=692766, m=34464
>>> freq1.timeit(10)
"Petar's initial"
3.914874792098999
>>> freq2.timeit(10)
"Matt's initial"
3.8329160213470459
>>> freq3.timeit(10)
"Noting max as we go"
4.1247420310974121
>>> freq4.timeit(10)
"Built-in Counter"
6.1084718704223633
>>> freq5.timeit(10)
"multiprocessing Matt's initial suggestion"
9.7867341041564941

Notes:

  • I'm cheating with multiprocessing.Pool instance as a kwarg for timing purposes as I wanted to avoid the pool start-up cost and timeit doesn't let you specify clean-up code. This was run on a "quad" cpu VM, I'm sure for some values of input data and cpu counts that multiprocessing will be faster.
  • For the most part return the highest frequency word, which may be random if there's a tie for first place.
  • Approximations of highest frequency may be faster (using sampling), but will be approximate.
  • Version 6 (one-liner) should be ignored for large values of n*m.
like image 33
MattH Avatar answered Nov 01 '22 03:11

MattH


You have to go through all the words at least once, yielding Omega(n). Storing the values you currently have for each different word yields Omega(log n).

If you find a storage(get/set) which is Omega(1) for the different words, you can create a solution with Omega(n). To my knowledge we have only Omega(log n) solutions for such storage(Independent of type: heap, map,tree, dict, set...).

EDIT(check comments): [Your solution is O(n log n) because of the dictionary check] + O(n) because of the max(), making it O(n log n) in total.... which is fine.

To my knowledge this (complexity wise) is good solution. You might get better performance of using different types of storage, like Syntax trees, or heap.. but the complexity should stay the same.

EDIT: From the comment discussion, you can get average and amortized Omega(n) with hashtable.

like image 1
Darwly Avatar answered Nov 01 '22 01:11

Darwly