Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group and combine items of multiple-column lists with itertools/more-itertools in Python

This code:

from itertools import groupby, count 
L = [38, 98, 110, 111, 112, 120, 121, 898] 
groups = groupby(L, key=lambda item, c=count():item-next(c))
tmp = [list(g) for k, g in groups]

Takes [38, 98, 110, 111, 112, 120, 121, 898] , groups it by consecutive numbers and merge them with this final output:

['38', '98', '110,112', '120,121', '898']

How can the same be done with a list of lists with multiple columns, like this list below where you can group them by name and the consecution of its second column value and then merge.

In other words, this data:

L= [
['Italy','1','3']
['Italy','2','1'],
['Spain','4','2'],
['Spain','5','8'],
['Italy','3','10'],
['Spain','6','4'],
['France','5','3'],
['Spain','20','2']]

should give the following output:

[['Italy','1-2-3','3-1-10'],
['France','5','3'],
['Spain','4-5-6','2-8-4'],
['Spain','20','2']]

Should more-itertools be more appropriate for this task?

Group and combine items of multiple-column lists with itertools/more-itertools in Python

like image 426
mistervela Avatar asked Feb 07 '18 12:02

mistervela


People also ask

How to generate combinations of lists in Python with itertools?

Python comes built-in with a helpful library called itertools, that provides helpful functions to work with iteratable objects. One of the many functions it comes with it the combinations () function. This, as the name implies, provides ways to generate combinations of lists. Let’s take a look at how the combinations () function works:

How to use itertools groupby () function in Python?

In this tutorial, we are going to learn about itertools.groupby () function in Python. To use this function firstly, we need to import the itertools module in our code. As the name says that itertools is a module that provides functions that work on iterators (like lists, dictionaries etc.).

What is itertools in Python?

Python comes built-in with a helpful library called itertools, that provides helpful functions to work with iteratable objects. One of the many functions it comes with it the combinations () function.

What is the difference between iterators () and combinations () in Python?

Itertools is a module in Python that provides various functions that work on iterators. Meanwhile, combinations () is a function in Python. Combinations () in Python This iterator (function) takes two parameters as input simultaneously.


3 Answers

You can build up on the same recipe and modify the lambda function to include the first item(country) from each row as well. Secondly, you need to sort the list first based on the last occurrence of the country in the list.

from itertools import groupby, count


L = [
    ['Italy', '1', '3'],
    ['Italy', '2', '1'],
    ['Spain', '4', '2'],
    ['Spain', '5', '8'],
    ['Italy', '3', '10'],
    ['Spain', '6', '4'],
    ['France', '5', '3'],
    ['Spain', '20', '2']]


indices = {row[0]: i for i, row in enumerate(L)}
sorted_l = sorted(L, key=lambda row: indices[row[0]])
groups = groupby(
    sorted_l,
    lambda item, c=count(): [item[0], int(item[1]) - next(c)]
)
for k, g in groups:
    print [k[0]] + ['-'.join(x) for x in zip(*(x[1:] for x in g))]

Output:

['Italy', '1-2-3', '3-1-10']
['France', '5', '3']
['Spain', '4-5-6', '2-8-4']
['Spain', '20', '2']
like image 131
Ashwini Chaudhary Avatar answered Oct 17 '22 00:10

Ashwini Chaudhary


This is essentially the same grouping technique, but rather than using itertools.count it uses enumerate to produce the indices.

First, we sort the data so that all items for a given country are grouped together, and the data is sorted. Then we use groupby to make a group for each country. Then we use groupby in the inner loop to group together the consecutive data for each country. Finally, we use zip & .join to re-arrange the data into the desired output format.

from itertools import groupby
from operator import itemgetter

lst = [
    ['Italy','1','3'],
    ['Italy','2','1'],
    ['Spain','4','2'],
    ['Spain','5','8'],
    ['Italy','3','10'],
    ['Spain','6','4'],
    ['France','5','3'],
    ['Spain','20','2'],
]

newlst = [[country] + ['-'.join(s) for s in zip(*[v[1][1:] for v in g])]
    for country, u in groupby(sorted(lst), itemgetter(0))
        for _, g in groupby(enumerate(u), lambda t: int(t[1][1]) - t[0])]

for row in newlst:
    print(row)

output

['France', '5', '3']
['Italy', '1-2-3', '3-1-10']
['Spain', '20', '2']
['Spain', '4-5-6', '2-8-4']

I admit that lambda is a bit cryptic; it'd probably better to use a proper def function instead. I'll add that here in a few minutes.


Here's the same thing using a much more readable key function.

def keyfunc(t):
    # Unpack the index and data
    i, data = t
    # Get the 2nd column from the data, as an integer
    val = int(data[1])
    # The difference between val & i is constant in a consecutive group
    return val - i

newlst = [[country] + ['-'.join(s) for s in zip(*[v[1][1:] for v in g])]
    for country, u in groupby(sorted(lst), itemgetter(0))
        for _, g in groupby(enumerate(u), keyfunc)]
like image 34
PM 2Ring Avatar answered Oct 17 '22 01:10

PM 2Ring


Instead of using itertools.groupby that requires multiple sorting, checking, etc. Here is an algorithmically optimized approach using dictionaries:

d = {}
flag = False
for country, i, j in L:
    temp = 1
    try:
        item = int(i)
        for counter, recs in  d[country].items():
            temp += 1
            last = int(recs[-1][0])
            if item in {last - 1, last, last + 1}:
                recs.append([i, j])
                recs.sort(key=lambda x: int(x[0]))
                flag = True
                break
        if flag:
            flag = False
            continue
        else:
            d[country][temp] = [[i, j]]
    except KeyError:
        d[country] = {}
        d[country][1] = [[i, j]]

Demo on a more complex example:

L = [['Italy', '1', '3'],
 ['Italy', '2', '1'],
 ['Spain', '4', '2'],
 ['Spain', '5', '8'],
 ['Italy', '3', '10'],
 ['Spain', '6', '4'],
 ['France', '5', '3'],
 ['Spain', '20', '2'],
 ['France', '5', '44'],
 ['France', '9', '3'],
 ['Italy', '3', '10'],
 ['Italy', '5', '17'],
 ['Italy', '4', '13'],]

{'France': {1: [['5', '3'], ['5', '44']], 2: [['9', '3']]},
 'Spain': {1: [['4', '2'], ['5', '8'], ['6', '4']], 2: [['20', '2']]},
 'Italy': {1: [['1', '3'], ['2', '1'], ['3', '10'], ['3', '10'], ['4', '13']], 2: [['5', '17']]}}

# You can then produce the results in your intended format as below:
for country, recs in d.items():
    for rec in recs.values():
        i, j = zip(*rec)
        print([country, '-'.join(i), '-'.join(j)])

['France', '5-5', '3-44']
['France', '9', '3']
['Italy', '1-2-3-3-4', '3-1-10-10-13']
['Italy', '5', '17']
['Spain', '4-5-6', '2-8-4']
['Spain', '20', '2']
like image 1
Mazdak Avatar answered Oct 17 '22 01:10

Mazdak