I have a data frame (survey data) called df that looks like this (this is sample data):
| respondent_id | r1age | r2age | r3age | r4age | r1smoke | r2smoke | r3smoke | r4smoke | r1income | r2income | r3income | r4income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16178 | 35 | 38 | 41 | 44 | 1 | 1 | 1 | 1 | 60 | 62 | 68 | 70 |
| 161719 | 65 | 68 | 71 | 74 | 0 | 0 | 0 | 1 | 50 | 52 | 54 | 56 |
| 161720 | 47 | 50 | 53 | 56 | 0 | 1 | 0 | 1 | 80 | 82 | 85 | 87 |
The number after the "r" or "h" represents the wave or period of each interview. For this particular example, there are only four interviews for each respondent, and data for 3 different variables (age, whether the respondent smokes, and his/her gross annual income in $10,000).
I'm interested in transforming this to get the following instead:
| respondent_id | t_1_period | t_age | t_1_age | t_smoke | t_1_smoke | t_income | t_1_income |
|---|---|---|---|---|---|---|---|
| 16178 | 1 | 38 | 35 | 1 | 1 | 62 | 60 |
| 16178 | 2 | 41 | 38 | 1 | 1 | 68 | 62 |
| 16178 | 3 | 44 | 41 | 1 | 1 | 70 | 68 |
| 161719 | 1 | 68 | 65 | 0 | 0 | 52 | 50 |
| 161719 | 2 | 71 | 68 | 0 | 0 | 54 | 52 |
| 161719 | 3 | 74 | 71 | 1 | 0 | 56 | 54 |
| 161720 | 1 | 50 | 47 | 1 | 0 | 82 | 80 |
| 161720 | 2 | 53 | 50 | 0 | 1 | 85 | 82 |
| 161720 | 3 | 56 | 53 | 1 | 0 | 87 | 85 |
I'm interested in repeating the respondents such that the number of observations for each respondent are the number of interviews/waves - 1 (that is, the unique transitions), and for each variable there must be t (current period) and t_1 (previous period) columns, again, for each transition. Additionally, I add a t_1_period column representing the number of the previous period for that observation.
I have tried the following:
df = pd.melt(df, id_vars=["respondent_id"])
variable_names = ["age", "smoke", "income"]
new_rows = []
for respondent_id in df["respondent_id"].unique():
df_temp = df[df["respondent_id"] == respondent_id]
for i in range(2, 5):
new_row = {"respondent_id": respondent_id, "t_1_period": i-1}
for var in variable_names:
if var not in ["income"]:
current_var = f"r{i}{var}"
previous_var = f"r{i-1}{var}"
new_row[f"t_{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
new_row[f"t_1_{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]
elif var == "income":
current_var = f"h{i}{var}"
previous_var = f"h{i-1}{var}"
new_row[f"t_h{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
new_row[f"t_1_h{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]
new_rows.append(new_row)
df_periods = pd.DataFrame(new_rows)
In my real data, I have much more than 3 variables: I sometimes have up to 100. Additionally, all variables are always present for all periods, however some of them can have NaNs, but the columns are there. In terms of respondents, I can also have a lot: as much as 50,000 for example. Note that some variables start with "h" instead of "r", and others with "s" (not present in this example).
My question: is there a faster way of transforming this? Every time I want to transform the data in this t vs. t-1 version for all variables I decide to include in variable_names I have to wait a lot. I believe there must be a better way of doing this. I appreciate your help, thank you.
There are many ways to approach that, wide_to_long is an option but you would need to pre-process the column names (it expects the stubnames as prefixes, not suffixes).
I'd suggest to use a MultiIndex and stack, here is an example that doesn't require to know the stubnames:
# set aside respondent_id and create a MultiIndex
tmp = df.set_index('respondent_id')
tmp.columns = pd.MultiIndex.from_frame(tmp.columns.str.extract(r'[rh](\d+)(\D+)'),
names=['t_1_period', None])
# reshape
tmp = tmp.stack(0, future_stack=True)
# concatenate the long format with a shifted version of itself
out = (pd.concat([tmp.groupby(level=0).shift(-1), tmp], keys=['t', 't_1'], axis=1)
.sort_index(axis=1, level=1, sort_remaining=False)
)
# flatten MultiIndex
out.columns = out.columns.map('_'.join)
out.reset_index(inplace=True)
# remove the last value per group
out = out[out['respondent_id'].duplicated(keep='last')].convert_dtypes()
Output:
respondent_id t_1_period t_age t_1_age t_income t_1_income t_smoke t_1_smoke
0 16178 1 38 35 62 60 1 1
1 16178 2 41 38 68 62 1 1
2 16178 3 44 41 70 68 1 1
4 161719 1 68 65 52 50 0 0
5 161719 2 71 68 54 52 0 0
6 161719 3 74 71 56 54 1 0
8 161720 1 50 47 82 80 1 0
9 161720 2 53 50 85 82 0 1
10 161720 3 56 53 87 85 1 0
One way is to avoid loops if you have many features:
import pandas as pd
data = {
"respondent_id": [16178, 161719, 161720],
"r1age": [35, 65, 47], "r2age": [38, 68, 50], "r3age": [41, 71, 53], "r4age": [44, 74, 56],
"r1smoke": [1, 0, 0], "r2smoke": [1, 0, 1], "r3smoke": [1, 0, 0], "r4smoke": [1, 1, 1],
"r1income": [60, 50, 80], "r2income": [62, 52, 82], "r3income": [68, 54, 85], "r4income": [70, 56, 87]
}
df = pd.DataFrame(data)
df_long = pd.melt(df, id_vars=["respondent_id"], var_name="variable", value_name="value")
print("Melted DataFrame:\n", df_long.head())
df_long['period'] = df_long['variable'].str.extract(r'(\d+)').astype(int)
df_long['variable'] = df_long['variable'].str.extract(r'([a-zA-Z]+[a-zA-Z]+)')
print("DataFrame after extracting period and variable:\n", df_long.head())
df_wide = df_long.pivot_table(index=['respondent_id', 'period'], columns='variable', values='value').reset_index()
print("Pivoted DataFrame:\n", df_wide.head())
df_t = df_wide.copy()
df_t_minus_1 = df_wide.copy()
df_t_minus_1['period'] += 1
print("DataFrame t:\n", df_t.head())
print("DataFrame t-1:\n", df_t_minus_1.head())
df_merged = pd.merge(df_t, df_t_minus_1, on=['respondent_id', 'period'], suffixes=('', '_t_1'))
print("Merged DataFrame:\n", df_merged.head())
df_merged = df_merged.rename(columns={
'period': 't_1_period',
'age': 't_age', 'age_t_1': 't_1_age',
'smoke': 't_smoke', 'smoke_t_1': 't_1_smoke',
'income': 't_income', 'income_t_1': 't_1_income'
})
print("Columns after renaming:\n", df_merged.head())
df_final = df_merged[df_merged['t_1_period'] > 1].copy()
df_final['t_1_period'] -= 1
print("Filtered DataFrame:\n", df_final.head())
df_final = df_final[['respondent_id', 't_1_period', 't_age', 't_1_age', 't_smoke', 't_1_smoke', 't_income', 't_1_income']]
print(df_final)
With gives
Melted DataFrame:
respondent_id variable value
0 16178 r1age 35
1 161719 r1age 65
2 161720 r1age 47
3 16178 r2age 38
4 161719 r2age 68
DataFrame after extracting period and variable:
respondent_id variable value period
0 16178 age 35 1
1 161719 age 65 1
2 161720 age 47 1
3 16178 age 38 2
4 161719 age 68 2
Pivoted DataFrame:
variable respondent_id period age income smoke
0 16178 1 35.0 60.0 1.0
1 16178 2 38.0 62.0 1.0
2 16178 3 41.0 68.0 1.0
3 16178 4 44.0 70.0 1.0
4 161719 1 65.0 50.0 0.0
DataFrame t:
variable respondent_id period age income smoke
0 16178 1 35.0 60.0 1.0
1 16178 2 38.0 62.0 1.0
...
5 56.0 54.0
6 82.0 80.0
7 85.0 82.0
8 87.0 85.0
where the last dataframe is the df you expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With