Python with Itertools?

Tags:

I have a .csv the contains 3 columns. PersonX, PersonY, and PersonZ. There are 7000 rows of names and different combinations. My goal is to see what pair and group of 3 are the highest match. I have not been able to find a formula in excel that would achieve this goal. I'm sure python would be able to with itertools combination but I'm not that advanced yet. The names can be in any order, just looking to see how many times those 2 or 3 people are in the same row. Any suggestions would be a great help, thank you!

Small Example of Data.

PersonX         PersonY             PersonZ
Aaron Ekblad    Keith Yandle        Vincent Trocheck
Aaron Ekblad    Denis Malgin        Mike Matheson
Aaron Ekblad    Denis Malgin        Mike Matheson
Aaron Ekblad    Jonathan Huberdeau  Keith Yandle
Aaron Ekblad    Jonathan Huberdeau  Keith Yandle
Aaron Ekblad    Jamie McGinn        Keith Yandle
Aaron Ekblad    Aleksander Barkov   Jonathan Huberdeau
Aaron Ekblad        
Adam Erne       Andrej Sustr        Vladislav Namestnikov
Adam Erne       Anthony Cirelli 
Adam Erne       
Adam Henrique   Rickard Rakell      Ryan Getzlaf
Adam Henrique   Brandon Montour     Ryan Getzlaf
Adam Henrique   Corey Perry         Brandon Montour
Adam Henrique   Corey Perry         Brandon Montour
Adam Henrique   Brian Gibbons       Andy Greene
Adam Henrique   Ryan Getzlaf    
Adam Henrique   Ondrej Kase 
Adam Henrique   Josh Manson 
Adam Henrique   Brian Gibbons   
Adam Henrique       
Adam Henrique

Starting Script

import csv
from itertools import combinations, product

#Header = PersonX PersonY PersonZ

#Import Game
with open('1718_All_Goals_&_Assists.csv', newline='') as f:
    next(f)
    skaters = '\n'.join(' '.join(row) for row in csv.reader(f))
    print(skaters)

525

asked Apr 12 '18 21:04

Michael T Johnson

1 Answers

You could simply use collections.Counter on your csv.reader:

from collections import Counter

>>> cnt = Counter(frozenset(item.strip() for item in line if item.strip()) for line in csv.reader(f))

You can get the sorted (descending) output by using the .most_common method of Counter:

>>> cnt.most_common()
[(frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}), 2),
 (frozenset({'Aaron Ekblad', 'Jonathan Huberdeau', 'Keith Yandle'}), 2),
 (frozenset({'Adam Henrique', 'Brandon Montour', 'Corey Perry'}), 2),
 (frozenset({'Adam Henrique'}), 2),
 (frozenset({'Aaron Ekblad', 'Keith Yandle', 'Vincent Trocheck'}), 1),
 (frozenset({'Aaron Ekblad', 'Jamie McGinn', 'Keith Yandle'}), 1),
 (frozenset({'Aaron Ekblad', 'Aleksander Barkov', 'Jonathan Huberdeau'}), 1),
 (frozenset({'Aaron Ekblad'}), 1),
 (frozenset({'Adam Erne', 'Andrej Sustr', 'Vladislav Namestnikov'}), 1),
 (frozenset({'Adam Erne', 'Anthony Cirelli'}), 1),
 (frozenset({'Adam Erne'}), 1),
 (frozenset({'Adam Henrique', 'Rickard Rakell', 'Ryan Getzlaf'}), 1),
 (frozenset({'Adam Henrique', 'Brandon Montour', 'Ryan Getzlaf'}), 1),
 (frozenset({'Adam Henrique', 'Andy Greene', 'Brian Gibbons'}), 1),
 (frozenset({'Adam Henrique', 'Ryan Getzlaf'}), 1),
 (frozenset({'Adam Henrique', 'Ondrej Kase'}), 1),
 (frozenset({'Adam Henrique', 'Josh Manson'}), 1),
 (frozenset({'Adam Henrique', 'Brian Gibbons'}), 1)]

Similarly you can get the most common (in case of ties it will return just one of those) using this:

>>> cnt.most_common(1)
[(frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}), 2)]

If you want all that have the maximum count you could use a custom approach using max:

>>> maximum_occurences = max(cnt.values())
>>> [group for group, occurences in cnt.items() if occurences == maximum_occurences]
[frozenset({'Aaron Ekblad', 'Denis Malgin', 'Mike Matheson'}),
 frozenset({'Aaron Ekblad', 'Jonathan Huberdeau', 'Keith Yandle'}),
 frozenset({'Adam Henrique', 'Brandon Montour', 'Corey Perry'}),
 frozenset({'Adam Henrique'})]

To add a bit more explanation:

Counter just counts elements, which should be exactly what you want - but the values to count have to be hashable. That's inconvenient because csv.reader will return lists (inconvenient because lists are unhashable), so you need another data-structure.

tuples will come to mind, however because you said that the names could be in any order you probably want an unordered collection. That means you should use frozenset. There is one caveat though: It cannot hold the same value multiple times, so if you have people with the same name you cannot use that.

I don't know how your csv is structured, so it's likely that the result may contain empty "columns" or whitespace before/after the names, that's why I filtered empty elements and stripped the names with:

item.strip() for item in line if item.strip()

before passing it to the frozenset.

In case you don't like that it strips twice you could add another comprehension or map:

frozenset(item for item in (item.strip() for item in line) if item)
frozenset(item for item in map(lambda x: x.strip(), line) if item)
frozenset(item for item in map(str.strip, line) if item)  # if all items are really of type str

By the way the outer comprehension could then be replaced by filter by bool:

frozenset(filter(bool, map(str.strip, line)))

Oh, the wonders of "functional programming". And I haven't even used itertools.

I totally forgot: In case you want to omit the single-person-groups you could easily filter the Counter before you access the most_common groups:

cnt = Counter({k: v for k, v in cnt.items() if len(k) > 1})

That can be applied afterwards or you can try to fiddle that into the comprehension inside the Counter, but I'll leave that as an exercise for the interested reader.

Just in case you could have the same name multiple times in one line and you don't want to lose that "information" you could count the names in each line and then convert that Counter to a tuple that is going to be counted.

Counter(tuple(Counter(filter(bool, map(str.strip, line))).most_common()) for line in csv.reader(f))

I hope I haven't overdone it with the one-liners.

answered Oct 06 '22 22:10

MSeifert

Related questions
                            
                                multiple autofilters for the same EPPlus worksheet
                            
                                OpenXml SpreadsheetDocument.Open(...) get exception - The main part is missing
                            
                                Use custom colors with Spreadsheet gem
                            
                                RExcel Error: There seems to be no R process connected to Excel
                            
                                Reach ActiveX or Form objects (textbox) from codebehind in an Excel document
                            
                                C# reading Excel cell values using Microsoft.Office.Interop.Excel
                            
                                Passing a Character Array from VBA to Fortran DLL through a Type is corrupting the other Type members
                            
                                Is it possible to pass a reference to a pointer from Excel VBA to C++?
                            
                                Unregister a XLL in Excel (VBA)
                            
                                Excel trailing comma bug in csv files
                            
                                Supported Html elements by Excel
                            
                                OpenXML SDK Inject VBA into excel workbook
                            
                                Save embedded Word Doc as PDF
                            
                                Finding the last row of an Excel spreadsheet when the last row is hidden
                            
                                Dynamically appending an array in VBA
                            
                                On Error Goto 0 not resetting error trapping
                            
                                File is corrupted after creating excel (.xlsx) file by using Apache POI with Java
                            
                                How to access a Word public variable in Excel VBA
                            
                                How can I parse an Excel file using SheetJS from an external link (Amazon S3)
                            
                                Iterate through columns in Read-only workbook in openpyxl

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python with Itertools?

Tags:

python-3.x

excel

itertools

Michael T Johnson

People also ask

1 Answers

To add a bit more explanation:

MSeifert

Recent Activity

Donate For Us