I have a .csv
the contains 3 columns. PersonX, PersonY, and PersonZ. There are 7000 rows of names and different combinations. My goal is to see what pair and group of 3 are the highest match. I have not been able to find a formula in excel that would achieve this goal. I'm sure python would be able to with itertools combination but I'm not that advanced yet. The names can be in any order, just looking to see how many times those 2 or 3 people are in the same row. Any suggestions would be a great help, thank you!
Small Example of Data.
PersonX PersonY PersonZ
Aaron Ekblad Keith Yandle Vincent Trocheck
Aaron Ekblad Denis Malgin Mike Matheson
Aaron Ekblad Denis Malgin Mike Matheson
Aaron Ekblad Jonathan Huberdeau Keith Yandle
Aaron Ekblad Jonathan Huberdeau Keith Yandle
Aaron Ekblad Jamie McGinn Keith Yandle
Aaron Ekblad Aleksander Barkov Jonathan Huberdeau
Aaron Ekblad
Adam Erne Andrej Sustr Vladislav Namestnikov
Adam Erne Anthony Cirelli
Adam Erne
Adam Henrique Rickard Rakell Ryan Getzlaf
Adam Henrique Brandon Montour Ryan Getzlaf
Adam Henrique Corey Perry Brandon Montour
Adam Henrique Corey Perry Brandon Montour
Adam Henrique Brian Gibbons Andy Greene
Adam Henrique Ryan Getzlaf
Adam Henrique Ondrej Kase
Adam Henrique Josh Manson
Adam Henrique Brian Gibbons
Adam Henrique
Adam Henrique
Starting Script
import csv
from itertools import combinations, product
#Header = PersonX PersonY PersonZ
#Import Game
with open('1718_All_Goals_&_Assists.csv', newline='') as f:
next(f)
skaters = '\n'.join(' '.join(row) for row in csv.reader(f))
print(skaters)
That being said, the iterators from itertools are often significantly faster than regular iteration from a standard Python for loop.
Itertools is a Python module that is part of the Python 3 standard libraries. It lets us perform memory and computation efficient tasks on iterators. It is inspired by constructs from APL, Haskell, and SML.
chain() function It is a function that takes a series of iterables and returns one iterable. It groups all the iterables together and produces a single iterable as output. Its output cannot be used directly and thus explicitly converted into iterables.
You could simply use collections.Counter
on your csv.reader
:
from collections import Counter
>>> cnt = Counter(frozenset(item.strip() for item in line if item.strip()) for line in csv.reader(f))
You can get the sorted (descending) output by using the .most_common
method of Counter:
>>> cnt.most_common()
[(frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}), 2),
(frozenset({'Aaron Ekblad', 'Jonathan Huberdeau', 'Keith Yandle'}), 2),
(frozenset({'Adam Henrique', 'Brandon Montour', 'Corey Perry'}), 2),
(frozenset({'Adam Henrique'}), 2),
(frozenset({'Aaron Ekblad', 'Keith Yandle', 'Vincent Trocheck'}), 1),
(frozenset({'Aaron Ekblad', 'Jamie McGinn', 'Keith Yandle'}), 1),
(frozenset({'Aaron Ekblad', 'Aleksander Barkov', 'Jonathan Huberdeau'}), 1),
(frozenset({'Aaron Ekblad'}), 1),
(frozenset({'Adam Erne', 'Andrej Sustr', 'Vladislav Namestnikov'}), 1),
(frozenset({'Adam Erne', 'Anthony Cirelli'}), 1),
(frozenset({'Adam Erne'}), 1),
(frozenset({'Adam Henrique', 'Rickard Rakell', 'Ryan Getzlaf'}), 1),
(frozenset({'Adam Henrique', 'Brandon Montour', 'Ryan Getzlaf'}), 1),
(frozenset({'Adam Henrique', 'Andy Greene', 'Brian Gibbons'}), 1),
(frozenset({'Adam Henrique', 'Ryan Getzlaf'}), 1),
(frozenset({'Adam Henrique', 'Ondrej Kase'}), 1),
(frozenset({'Adam Henrique', 'Josh Manson'}), 1),
(frozenset({'Adam Henrique', 'Brian Gibbons'}), 1)]
Similarly you can get the most common (in case of ties it will return just one of those) using this:
>>> cnt.most_common(1)
[(frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}), 2)]
If you want all that have the maximum count you could use a custom approach using max
:
>>> maximum_occurences = max(cnt.values())
>>> [group for group, occurences in cnt.items() if occurences == maximum_occurences]
[frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}),
frozenset({'Aaron Ekblad', 'Jonathan Huberdeau', 'Keith Yandle'}),
frozenset({'Adam Henrique', 'Brandon Montour', 'Corey Perry'}),
frozenset({'Adam Henrique'})]
Counter
just counts elements, which should be exactly what you want - but the values to count have to be hashable. That's inconvenient because csv.reader
will return lists (inconvenient because lists are unhashable), so you need another data-structure.
tuples
will come to mind, however because you said that the names could be in any order you probably want an unordered collection. That means you should use frozenset
. There is one caveat though: It cannot hold the same value multiple times, so if you have people with the same name you cannot use that.
I don't know how your csv is structured, so it's likely that the result may contain empty "columns" or whitespace before/after the names, that's why I filtered empty elements and stripped the names with:
item.strip() for item in line if item.strip()
before passing it to the frozenset
.
In case you don't like that it strips
twice you could add another comprehension or map
:
frozenset(item for item in (item.strip() for item in line) if item)
frozenset(item for item in map(lambda x: x.strip(), line) if item)
frozenset(item for item in map(str.strip, line) if item) # if all items are really of type str
By the way the outer comprehension could then be replaced by filter
by bool
:
frozenset(filter(bool, map(str.strip, line)))
Oh, the wonders of "functional programming". And I haven't even used itertools
.
I totally forgot: In case you want to omit the single-person-groups you could easily filter the Counter
before you access the most_common
groups:
cnt = Counter({k: v for k, v in cnt.items() if len(k) > 1})
That can be applied afterwards or you can try to fiddle that into the comprehension inside the Counter
, but I'll leave that as an exercise for the interested reader.
Just in case you could have the same name multiple times in one line and you don't want to lose that "information" you could count the names in each line and then convert that Counter to a tuple that is going to be counted.
Counter(tuple(Counter(filter(bool, map(str.strip, line))).most_common()) for line in csv.reader(f))
I hope I haven't overdone it with the one-liners.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With