I have a csv file with transaction data of the following form
import pandas as pd
df = pd.DataFrame({'OrderID':[1,1,1,1,2,2], 'ItemID':[1,2,3,4,1,2]})
print(df)
ItemID OrderID
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
I want to obtain a list that contains for every OrderID the sets of items.
This can be obtained with
df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
[{1, 2, 3, 4}, {1, 2}]
However, on a csv file with 9 million rows this takes some time. Thus, I'm wondering if there is a faster way? I'm interested in any solution with pandas or that operates directly on a .csv-file
First of all I want to thank you guys, for your awesome input! I took a sample of 50000 OrderIds (and the corresponding items) from my real data and applied several of the methods from to the data set. And here are the results
Note that I used the updated version of the pir programm. So the winner is divakar, even if we only consider the list of sets output.
On my whole data set, his faster set approach has a duration of 5.05 seconds and his faster list based approach a duration of only 2.32s. That is a huge gain from the initial 115 seconds! Thanks again!
new methoddefaultdict
from collections import defaultdict
def pir(df):
d = defaultdict(set)
for n, g in df.groupby('OrderID').ItemID:
d[n].update(g.values.tolist())
return list(d.values())
sample
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 1000, 10000000),
ItemID=np.random.randint(0, 1000, 10000000)))
old method
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
[gu(i) for i in range(len(uo))]
[{1, 2, 3, 4}, {1, 2}]
old timing
code:
def pir(df):
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
return [gu(i) for i in range(len(uo))]
def jez(df):
arr = df.groupby('OrderID')['ItemID'].unique().values
return [set(v) for v in arr]
def div(df):
a = df.values
sidx = a[:,1].argsort(kind='mergesort')
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(set,out))
def quik(df):
return df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
with sample data
with more data
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000),
ItemID=np.random.randint(0, 10, 10000)))
even more data
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000000),
ItemID=np.random.randint(0, 10, 10000000)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With