Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python with Itertools?

I have a .csv the contains 3 columns. PersonX, PersonY, and PersonZ. There are 7000 rows of names and different combinations. My goal is to see what pair and group of 3 are the highest match. I have not been able to find a formula in excel that would achieve this goal. I'm sure python would be able to with itertools combination but I'm not that advanced yet. The names can be in any order, just looking to see how many times those 2 or 3 people are in the same row. Any suggestions would be a great help, thank you!

Small Example of Data.

PersonX         PersonY             PersonZ
Aaron Ekblad    Keith Yandle        Vincent Trocheck
Aaron Ekblad    Denis Malgin        Mike Matheson
Aaron Ekblad    Denis Malgin        Mike Matheson
Aaron Ekblad    Jonathan Huberdeau  Keith Yandle
Aaron Ekblad    Jonathan Huberdeau  Keith Yandle
Aaron Ekblad    Jamie McGinn        Keith Yandle
Aaron Ekblad    Aleksander Barkov   Jonathan Huberdeau
Aaron Ekblad        
Adam Erne       Andrej Sustr        Vladislav Namestnikov
Adam Erne       Anthony Cirelli 
Adam Erne       
Adam Henrique   Rickard Rakell      Ryan Getzlaf
Adam Henrique   Brandon Montour     Ryan Getzlaf
Adam Henrique   Corey Perry         Brandon Montour
Adam Henrique   Corey Perry         Brandon Montour
Adam Henrique   Brian Gibbons       Andy Greene
Adam Henrique   Ryan Getzlaf    
Adam Henrique   Ondrej Kase 
Adam Henrique   Josh Manson 
Adam Henrique   Brian Gibbons   
Adam Henrique       
Adam Henrique   

Starting Script

import csv
from itertools import combinations, product

#Header = PersonX PersonY PersonZ

#Import Game
with open('1718_All_Goals_&_Assists.csv', newline='') as f:
    next(f)
    skaters = '\n'.join(' '.join(row) for row in csv.reader(f))
    print(skaters)
like image 525
Michael T Johnson Avatar asked Apr 12 '18 21:04

Michael T Johnson


People also ask

Is Itertools faster than for loops Python?

That being said, the iterators from itertools are often significantly faster than regular iteration from a standard Python for loop.

Is Itertools in Python standard library?

Itertools is a Python module that is part of the Python 3 standard libraries. It lets us perform memory and computation efficient tasks on iterators. It is inspired by constructs from APL, Haskell, and SML.

What is chain () in Python?

chain() function It is a function that takes a series of iterables and returns one iterable. It groups all the iterables together and produces a single iterable as output. Its output cannot be used directly and thus explicitly converted into iterables.


1 Answers

You could simply use collections.Counter on your csv.reader:

from collections import Counter

>>> cnt = Counter(frozenset(item.strip() for item in line if item.strip()) for line in csv.reader(f))

You can get the sorted (descending) output by using the .most_common method of Counter:

>>> cnt.most_common()
[(frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}), 2),
 (frozenset({'Aaron Ekblad', 'Jonathan Huberdeau', 'Keith Yandle'}), 2),
 (frozenset({'Adam Henrique', 'Brandon Montour', 'Corey Perry'}), 2),
 (frozenset({'Adam Henrique'}), 2),
 (frozenset({'Aaron Ekblad', 'Keith Yandle', 'Vincent Trocheck'}), 1),
 (frozenset({'Aaron Ekblad', 'Jamie McGinn', 'Keith Yandle'}), 1),
 (frozenset({'Aaron Ekblad', 'Aleksander Barkov', 'Jonathan Huberdeau'}), 1),
 (frozenset({'Aaron Ekblad'}), 1),
 (frozenset({'Adam Erne', 'Andrej Sustr', 'Vladislav Namestnikov'}), 1),
 (frozenset({'Adam Erne', 'Anthony Cirelli'}), 1),
 (frozenset({'Adam Erne'}), 1),
 (frozenset({'Adam Henrique', 'Rickard Rakell', 'Ryan Getzlaf'}), 1),
 (frozenset({'Adam Henrique', 'Brandon Montour', 'Ryan Getzlaf'}), 1),
 (frozenset({'Adam Henrique', 'Andy Greene', 'Brian Gibbons'}), 1),
 (frozenset({'Adam Henrique', 'Ryan Getzlaf'}), 1),
 (frozenset({'Adam Henrique', 'Ondrej Kase'}), 1),
 (frozenset({'Adam Henrique', 'Josh Manson'}), 1),
 (frozenset({'Adam Henrique', 'Brian Gibbons'}), 1)]

Similarly you can get the most common (in case of ties it will return just one of those) using this:

>>> cnt.most_common(1)
[(frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}), 2)]

If you want all that have the maximum count you could use a custom approach using max:

>>> maximum_occurences = max(cnt.values())
>>> [group for group, occurences in cnt.items() if occurences == maximum_occurences]
[frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}),
 frozenset({'Aaron Ekblad', 'Jonathan Huberdeau', 'Keith Yandle'}),
 frozenset({'Adam Henrique', 'Brandon Montour', 'Corey Perry'}),
 frozenset({'Adam Henrique'})]

To add a bit more explanation:

Counter just counts elements, which should be exactly what you want - but the values to count have to be hashable. That's inconvenient because csv.reader will return lists (inconvenient because lists are unhashable), so you need another data-structure.

tuples will come to mind, however because you said that the names could be in any order you probably want an unordered collection. That means you should use frozenset. There is one caveat though: It cannot hold the same value multiple times, so if you have people with the same name you cannot use that.

I don't know how your csv is structured, so it's likely that the result may contain empty "columns" or whitespace before/after the names, that's why I filtered empty elements and stripped the names with:

item.strip() for item in line if item.strip()

before passing it to the frozenset.

In case you don't like that it strips twice you could add another comprehension or map:

frozenset(item for item in (item.strip() for item in line) if item)
frozenset(item for item in map(lambda x: x.strip(), line) if item)
frozenset(item for item in map(str.strip, line) if item)  # if all items are really of type str

By the way the outer comprehension could then be replaced by filter by bool:

frozenset(filter(bool, map(str.strip, line)))

Oh, the wonders of "functional programming". And I haven't even used itertools.

I totally forgot: In case you want to omit the single-person-groups you could easily filter the Counter before you access the most_common groups:

cnt = Counter({k: v for k, v in cnt.items() if len(k) > 1})

That can be applied afterwards or you can try to fiddle that into the comprehension inside the Counter, but I'll leave that as an exercise for the interested reader.


Just in case you could have the same name multiple times in one line and you don't want to lose that "information" you could count the names in each line and then convert that Counter to a tuple that is going to be counted.

Counter(tuple(Counter(filter(bool, map(str.strip, line))).most_common()) for line in csv.reader(f))

I hope I haven't overdone it with the one-liners.

like image 94
MSeifert Avatar answered Oct 06 '22 22:10

MSeifert