Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find most common string in a 2D list

I have a 2D list:

arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]

I want to find the most frequent element in the 2D list. In the above example, the most common string is 'Mohit'.

I know I can use brute force using two for loops and a dictionary to do this, but is there a more efficient way using numpy or any other library?

The nested lists could be of different lengths

Can someone also add the time of their methods? To find the fasted method. Also the caveats at which it might not be very efficient.

Edit

These are the timings of different methods on my system:

#timegb
%%timeit
collections.Counter(chain.from_iterable(arr)).most_common(1)[0][0]
5.91 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Kevin Fang and Curious Mind
%%timeit
flat_list = [item for sublist in arr for item in sublist]
collections.Counter(flat_list).most_common(1)[0]
6.42 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
c = collections.Counter(item for sublist in arr for item in sublist).most_common(1)c[0][0]
6.79 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Mayank Porwal
def most_common(lst):
    return max(set(lst), key=lst.count)
%%timeit
ls = list(chain.from_iterable(arr))
most_common(ls)
2.33 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#U9-Forward
%%timeit
l=[x for i in arr for x in i]
max(l,key=l.count)
2.6 µs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Mayank Porwal's method runs the fastest on my system.

like image 256
Mohit Motwani Avatar asked Nov 23 '18 05:11

Mohit Motwani


3 Answers

  1. Flatten the list with itertools.chain.from_iterable
  2. Apply a Counter.

Demo:

>>> from itertools import chain
>>> from collections import Counter
>>> 
>>> lst = [['Mohit', 'shini','Manoj','Mot'],
...:      ['Mohit', 'shini','Manoj'],
...:      ['Mohit', 'Vis', 'Nusrath']]
...:      
>>> Counter(chain.from_iterable(lst)).most_common(1)[0][0]
'Mohit'

Details:

>>> list(chain.from_iterable(lst))
['Mohit',
 'shini',
 'Manoj',
 'Mot',
 'Mohit',
 'shini',
 'Manoj',
 'Mohit',
 'Vis',
 'Nusrath']
>>> Counter(chain.from_iterable(lst))
Counter({'Manoj': 2, 'Mohit': 3, 'Mot': 1, 'Nusrath': 1, 'Vis': 1, 'shini': 2})
>>> Counter(chain.from_iterable(lst)).most_common(1)
[('Mohit', 3)]

Some timings:

>>> lst = lst*100
>>> %timeit Counter(chain.from_iterable(lst)).most_common(1)[0][0] # timgeb
53.7 µs ± 411 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit max([x for i in lst for x in i], key=l.count) # U9-Forward
207 µs ± 389 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit Counter([x for sublist in lst for x in sublist]).most_common(1)[0][0] # Curious_Mind/Kevin Fang #1
75.2 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit Counter(item for sublist in lst for item in sublist).most_common(1)[0][0] # Kevin Fang #2
95.2 µs ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit flat = list(chain.from_iterable(lst)); max(set(flat), key=flat.count) # Mayank Porwal
98.4 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(Note that Kevin Fang's second solution is a bit slower than the first one, but more memory efficient.)

like image 174
timgeb Avatar answered Oct 23 '22 20:10

timgeb


I'd suggest flatten out the 2D Array and then use a counter to find out the most frequent element.

flat_list = [item for sublist in arr for item in sublist]
from collections import Counter
Counter(flat_list).most_common(1)[0]
# ('Mohit', 3)
Counter(flat_list).most_common(1)[0][0]
# 'Mohit'

Not sure if it is the fastest approach though.

Edit:

@timgeb's answer has a faster way to flatten the list using itertools.chain

A more space efficient way suggested by @schwobaseggl:

from collections import Counter
c = Counter(item for sublist in arr for item in sublist).most_common(1)
# [('Mohit', 3)]
c[0][0]
# 'Mohit'
like image 4
Kevin Fang Avatar answered Oct 23 '22 22:10

Kevin Fang


One way to do it this way,

import collections
import time
start_time = time.time()
arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]

c = collections.Counter([x for sublist in arr for x in sublist])
print(c.most_common(1) )
print("--- %s seconds ---" % (time.time() - start_time)) 

Time taken: 0.00016713142395 seconds

DEMO: http://tpcg.io/NH3zjm

like image 2
Always Sunny Avatar answered Oct 23 '22 21:10

Always Sunny