i have the following table
date ui mw maxw tC HL msurp
01/03/2004 A 10 10 eC 0.25 0.1
01/04/2004 A 10 10 eC 0.25 -0.1
01/03/2004 B 20 20 bC 0.5 0.3
01/03/2004 B 20 20 bC 0.25 0.3
what i am looking to do is add a column to this table that basically enumerates the unique combinations of ui, mw, maxw, tC and HL and enumerates
so for example in the above table
unique combinations of ui, mw, maxw, tC and HL are
A,10, 10, eC, 0.25
B,20, 20, bC, 0.5
B,20, 20, bC, 0.5
There are total 3 so the output should be something like
date ui mw maxw tC HL msurp counter
01/03/2004 A 10 10 eC 0.25 0.1 1
01/04/2004 A 10 10 eC 0.25 -0.1 1
01/03/2004 B 20 20 bC 0.5 0.3 2
01/03/2004 B 20 20 bC 0.25 0.3 3
By doing groupby() pandas returns you a dict of grouped DFs. You can easily get the key list of this dict by python built in function keys() . Save this answer.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Option 1pd.Series.factorize
df.assign(
counter=df[['ui', 'mw', 'maxw', 'tC', 'HL']].apply(tuple, 1).factorize()[0] + 1)
date ui mw maxw tC HL msurp counter
0 01/03/2004 A 10 10 eC 0.25 0.1 1
1 01/04/2004 A 10 10 eC 0.25 -0.1 1
2 01/03/2004 B 20 20 bC 0.50 0.3 2
3 01/03/2004 B 20 20 bC 0.25 0.3 3
Option 1.5
More obnoxious version of option 1 but should be faster
df.assign(
counter=pd.factorize(list(zip(
*[df[c].values.tolist() for c in ['ui', 'mw', 'maxw', 'tC', 'HL']]
)))[0] + 1
)
date ui mw maxw tC HL msurp counter
0 01/03/2004 A 10 10 eC 0.25 0.1 1
1 01/04/2004 A 10 10 eC 0.25 -0.1 1
2 01/03/2004 B 20 20 bC 0.50 0.3 2
3 01/03/2004 B 20 20 bC 0.25 0.3 3
Option 2
@ayhan's answer (will delete if he posts it)
df.assign(
counter=df.groupby(['ui', 'mw', 'maxw', 'tC', 'HL']).ngroup() + 1)
date ui mw maxw tC HL msurp counter
0 01/03/2004 A 10 10 eC 0.25 0.1 1
1 01/04/2004 A 10 10 eC 0.25 -0.1 1
2 01/03/2004 B 20 20 bC 0.50 0.3 3
3 01/03/2004 B 20 20 bC 0.25 0.3 2
Timing
Code Below
(lambda r: r.div(r.min(1), 0).assign(best=lambda x: x.idxmin(1)))(results)
pir1 pir2 ayhan best
100 17.260639 1.000000 3.438354 pir2
300 30.550010 1.000000 2.598456 pir2
1000 43.201163 1.000000 1.236190 pir2
3000 61.593932 1.000000 1.025420 pir2
10000 127.003138 2.177171 1.000000 ayhan
pir1 = lambda d: d.assign(counter=d[['ui', 'mw', 'maxw', 'tC', 'HL']].apply(tuple, 1).factorize()[0] + 1)
pir2 = lambda d: d.assign(counter=pd.factorize(list(zip(*[d[c].values.tolist() for c in ['ui', 'mw', 'maxw', 'tC', 'HL']])))[0] + 1)
ayhan = lambda d: d.assign(counter=d.groupby(['ui', 'mw', 'maxw', 'tC', 'HL']).ngroup() + 1)
results = pd.DataFrame(
index=[100, 300, 1000, 3000, 10000],
columns='pir1 pir2 ayhan'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(loglog=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With