Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From transaction data to list of sets in an efficient way

I have a csv file with transaction data of the following form

import pandas as pd
df = pd.DataFrame({'OrderID':[1,1,1,1,2,2], 'ItemID':[1,2,3,4,1,2]})
print(df)
   ItemID  OrderID
0       1        1
1       2        1
2       3        1
3       4        1
4       1        2
5       2        2

I want to obtain a list that contains for every OrderID the sets of items.

This can be obtained with

df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
[{1, 2, 3, 4}, {1, 2}]

However, on a csv file with 9 million rows this takes some time. Thus, I'm wondering if there is a faster way? I'm interested in any solution with pandas or that operates directly on a .csv-file


First of all I want to thank you guys, for your awesome input! I took a sample of 50000 OrderIds (and the corresponding items) from my real data and applied several of the methods from to the data set. And here are the results

BenchmarkResults

Note that I used the updated version of the pir programm. So the winner is divakar, even if we only consider the list of sets output.

On my whole data set, his faster set approach has a duration of 5.05 seconds and his faster list based approach a duration of only 2.32s. That is a huge gain from the initial 115 seconds! Thanks again!

like image 895
Quickbeam2k1 Avatar asked Dec 18 '22 11:12

Quickbeam2k1


1 Answers

new method
defaultdict

from collections import defaultdict

def pir(df):
    d = defaultdict(set)
    for n, g in df.groupby('OrderID').ItemID:
        d[n].update(g.values.tolist())

    return list(d.values())

sample

df = pd.DataFrame(dict(OrderID=np.random.randint(0, 1000, 10000000),
                       ItemID=np.random.randint(0, 1000, 10000000)))

enter image description here


old method

uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)

def gu(i):
    return set(ui[ii[io == i]].tolist())

[gu(i) for i in range(len(uo))]

[{1, 2, 3, 4}, {1, 2}]

old timing
code:

def pir(df):
    uo, io = np.unique(df.OrderID.values, return_inverse=True)
    ui, ii = np.unique(df.ItemID.values, return_inverse=True)

    def gu(i):
        return set(ui[ii[io == i]].tolist())

    return [gu(i) for i in range(len(uo))]

def jez(df):
    arr = df.groupby('OrderID')['ItemID'].unique().values
    return [set(v) for v in arr]

def div(df):
    a = df.values
    sidx = a[:,1].argsort(kind='mergesort')
    cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
    out = np.split(a[sidx,0], cut_idx)
    return list(map(set,out))

def quik(df):
    return df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()

with sample data
enter image description here

with more data

df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000),
                       ItemID=np.random.randint(0, 10, 10000)))

enter image description here

even more data

df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000000),
                       ItemID=np.random.randint(0, 10, 10000000)))

enter image description here

like image 179
piRSquared Avatar answered Mar 24 '23 16:03

piRSquared