So I have a dataframe that contains some wrong information that I want to fix:
import pandas as pd
tuples_index = [(1,1990), (2,1999), (2,2002), (3,1992), (3,1994), (3,1996)]
index = pd.MultiIndex.from_tuples(tuples_index, names=['id', 'FirstYear'])
df = pd.DataFrame([2007, 2006, 2006, 2000, 2000, 2000], index=index, columns=['LastYear'] )
df
Out[4]:
LastYear
id FirstYear
1 1990 2007
2 1999 2006
2002 2006
3 1992 2000
1994 2000
1996 2000
id refers to a business, and this DataFrame is a small example slice of a much larger one that shows how a business moves. Each record is a unique location, and I want to capture the first and last year it was there. The current 'LastYear' is accurate for businesses with only one record, and accurate for the latest record of businesses for more than one record. What the df should look like at the end is this:
LastYear
id FirstYear
1 1990 2007
2 1999 2002
2002 2006
3 1992 1994
1994 1996
1996 2000
And what I did to get it there was super clunky:
multirecord = df.groupby(level=0).filter(lambda x: len(x) > 1)
multirecord_grouped = multirecord.groupby(level=0)
ls = []
for _, group in multirecord_grouped:
levels = group.index.get_level_values(level=1).tolist() + [group['LastYear'].iloc[-1]]
ls += levels[1:]
multirecord['LastYear'] = pd.Series(ls, index=multirecord.index.copy())
final_joined = pd.concat([df.groupby(level=0).filter(lambda x: len(x) == 1),multirecord]).sort_index()
Is there a better way?
shift_year = lambda df: df.index.get_level_values('FirstYear').to_series().shift(-1)
df.groupby(level=0).apply(shift_year) \
.combine_first(df.LastYear).astype(int) \
.rename('LastYear').to_frame()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With