Survey data many periods: transformation to current and previous period (wide to long format)

Question

I have a data frame (survey data) called df that looks like this (this is sample data):

respondent_id	r1age	r2age	r3age	r4age	r1smoke	r2smoke	r3smoke	r4smoke	r1income	r2income	r3income	r4income
16178	35	38	41	44	1	1	1	1	60	62	68	70
161719	65	68	71	74	0	0	0	1	50	52	54	56
161720	47	50	53	56	0	1	0	1	80	82	85	87

The number after the "r" or "h" represents the wave or period of each interview. For this particular example, there are only four interviews for each respondent, and data for 3 different variables (age, whether the respondent smokes, and his/her gross annual income in $10,000).

I'm interested in transforming this to get the following instead:

respondent_id	t_1_period	t_age	t_1_age	t_smoke	t_1_smoke	t_income	t_1_income
16178	1	38	35	1	1	62	60
16178	2	41	38	1	1	68	62
16178	3	44	41	1	1	70	68
161719	1	68	65	0	0	52	50
161719	2	71	68	0	0	54	52
161719	3	74	71	1	0	56	54
161720	1	50	47	1	0	82	80
161720	2	53	50	0	1	85	82
161720	3	56	53	1	0	87	85

I'm interested in repeating the respondents such that the number of observations for each respondent are the number of interviews/waves - 1 (that is, the unique transitions), and for each variable there must be t (current period) and t_1 (previous period) columns, again, for each transition. Additionally, I add a t_1_period column representing the number of the previous period for that observation.

I have tried the following:

df = pd.melt(df, id_vars=["respondent_id"])

variable_names = ["age", "smoke", "income"]

new_rows = []
for respondent_id in df["respondent_id"].unique():
  df_temp = df[df["respondent_id"] == respondent_id]

  for i in range(2, 5):
    new_row = {"respondent_id": respondent_id, "t_1_period": i-1}
    for var in variable_names:
      if var not in ["income"]:
        current_var = f"r{i}{var}"
        previous_var = f"r{i-1}{var}"
        new_row[f"t_{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
        new_row[f"t_1_{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]
      elif var == "income":
        current_var = f"h{i}{var}"
        previous_var = f"h{i-1}{var}"
        new_row[f"t_h{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
        new_row[f"t_1_h{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]

    new_rows.append(new_row)

df_periods = pd.DataFrame(new_rows)

In my real data, I have much more than 3 variables: I sometimes have up to 100. Additionally, all variables are always present for all periods, however some of them can have NaNs, but the columns are there. In terms of respondents, I can also have a lot: as much as 50,000 for example. Note that some variables start with "h" instead of "r", and others with "s" (not present in this example).

My question: is there a faster way of transforming this? Every time I want to transform the data in this t vs. t-1 version for all variables I decide to include in variable_names I have to wait a lot. I believe there must be a better way of doing this. I appreciate your help, thank you.

mozway · Accepted Answer

There are many ways to approach that, wide_to_long is an option but you would need to pre-process the column names (it expects the stubnames as prefixes, not suffixes).

I'd suggest to use a MultiIndex and stack, here is an example that doesn't require to know the stubnames:

# set aside respondent_id and create a MultiIndex
tmp = df.set_index('respondent_id')
tmp.columns = pd.MultiIndex.from_frame(tmp.columns.str.extract(r'[rh](\d+)(\D+)'),
                                       names=['t_1_period', None])

# reshape
tmp = tmp.stack(0, future_stack=True)

# concatenate the long format with a shifted version of itself
out = (pd.concat([tmp.groupby(level=0).shift(-1), tmp], keys=['t', 't_1'], axis=1)
         .sort_index(axis=1, level=1, sort_remaining=False)
      )

# flatten MultiIndex
out.columns = out.columns.map('_'.join)
out.reset_index(inplace=True)

# remove the last value per group
out = out[out['respondent_id'].duplicated(keep='last')].convert_dtypes()

Output:

    respondent_id t_1_period  t_age  t_1_age  t_income  t_1_income  t_smoke  t_1_smoke
0           16178          1     38       35        62          60        1          1
1           16178          2     41       38        68          62        1          1
2           16178          3     44       41        70          68        1          1
4          161719          1     68       65        52          50        0          0
5          161719          2     71       68        54          52        0          0
6          161719          3     74       71        56          54        1          0
8          161720          1     50       47        82          80        1          0
9          161720          2     53       50        85          82        0          1
10         161720          3     56       53        87          85        1          0

Serge de Gosson de Varennes · Answer

One way is to avoid loops if you have many features:

import pandas as pd

data = {
    "respondent_id": [16178, 161719, 161720],
    "r1age": [35, 65, 47], "r2age": [38, 68, 50], "r3age": [41, 71, 53], "r4age": [44, 74, 56],
    "r1smoke": [1, 0, 0], "r2smoke": [1, 0, 1], "r3smoke": [1, 0, 0], "r4smoke": [1, 1, 1],
    "r1income": [60, 50, 80], "r2income": [62, 52, 82], "r3income": [68, 54, 85], "r4income": [70, 56, 87]
}

df = pd.DataFrame(data)

df_long = pd.melt(df, id_vars=["respondent_id"], var_name="variable", value_name="value")
print("Melted DataFrame:
", df_long.head())

df_long['period'] = df_long['variable'].str.extract(r'(\d+)').astype(int)
df_long['variable'] = df_long['variable'].str.extract(r'([a-zA-Z]+[a-zA-Z]+)')
print("DataFrame after extracting period and variable:
", df_long.head())

df_wide = df_long.pivot_table(index=['respondent_id', 'period'], columns='variable', values='value').reset_index()
print("Pivoted DataFrame:
", df_wide.head())

df_t = df_wide.copy()
df_t_minus_1 = df_wide.copy()
df_t_minus_1['period'] += 1
print("DataFrame t:
", df_t.head())
print("DataFrame t-1:
", df_t_minus_1.head())

df_merged = pd.merge(df_t, df_t_minus_1, on=['respondent_id', 'period'], suffixes=('', '_t_1'))
print("Merged DataFrame:
", df_merged.head())

df_merged = df_merged.rename(columns={
    'period': 't_1_period',
    'age': 't_age', 'age_t_1': 't_1_age',
    'smoke': 't_smoke', 'smoke_t_1': 't_1_smoke',
    'income': 't_income', 'income_t_1': 't_1_income'
})
print("Columns after renaming:
", df_merged.head())

df_final = df_merged[df_merged['t_1_period'] > 1].copy()
df_final['t_1_period'] -= 1
print("Filtered DataFrame:
", df_final.head())

df_final = df_final[['respondent_id', 't_1_period', 't_age', 't_1_age', 't_smoke', 't_1_smoke', 't_income', 't_1_income']]


print(df_final)

With gives

Melted DataFrame:
    respondent_id variable  value
0          16178    r1age     35
1         161719    r1age     65
2         161720    r1age     47
3          16178    r2age     38
4         161719    r2age     68
DataFrame after extracting period and variable:
    respondent_id variable  value  period
0          16178      age     35       1
1         161719      age     65       1
2         161720      age     47       1
3          16178      age     38       2
4         161719      age     68       2
Pivoted DataFrame:
 variable  respondent_id  period   age  income  smoke
0                 16178       1  35.0    60.0    1.0
1                 16178       2  38.0    62.0    1.0
2                 16178       3  41.0    68.0    1.0
3                 16178       4  44.0    70.0    1.0
4                161719       1  65.0    50.0    0.0
DataFrame t:
 variable  respondent_id  period   age  income  smoke
0                 16178       1  35.0    60.0    1.0
1                 16178       2  38.0    62.0    1.0
...
5             56.0        54.0  
6             82.0        80.0  
7             85.0        82.0  
8             87.0        85.0

where the last dataframe is the df you expected.

Survey data many periods: transformation to current and previous period (wide to long format)

Tags:

performance

python

optimization

pandas

dataframe

caproki

2 Answers

mozway

Serge de Gosson de Varennes

Recent Activity

Donate For Us

Survey data many periods: transformation to current and previous period (wide to long format)

Tags:

performance

python

optimization

pandas

dataframe

caproki

2 Answers

mozway

Serge de Gosson de Varennes

Related questions

Recent Activity

Donate For Us