I have a large data set i need to conduct time-series analysis on. The data is currently in excel formated as below:
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
| ID | Q1 09 variable X | Q2 09 variable X | Q3 09 variable X | Q4 09 variable X | Q1 10 variable X | Q2 10 variable X | Q3 10 variable X | Q4 10 variable X | Q1 09 variable Y | Q2 09 variable Y | Q3 09 variable Y | Q4 09 variable Y | Q1 10 variable Y | Q2 10 variable Y | Q3 10 variable Y | Q4 10 variable Y |
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
| 1 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 2 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 3 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 4 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 5 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 6 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 7 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 8 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 9 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
The above table is an example. The dataset contains data on all quarters up to Q1 2019. I need the data to look like this:
+----+-------+------------+------------+
| ID | T | Variable X | Variable Y |
+----+-------+------------+------------+
| 1 | Q1 09 | X | Y |
| 1 | Q2 09 | X | Y |
| 1 | Q3 09 | X | Y |
| 1 | Q4 09 | X | Y |
| 1 | Q1 10 | X | Y |
| 1 | Q2 10 | X | Y |
| 1 | Q3 10 | X | Y |
| 1 | Q4 10 | X | Y |
| 2 | Q1 09 | X | Y |
| 2 | Q2 09 | X | Y |
| 2 | Q3 09 | X | Y |
| 2 | Q4 09 | X | Y |
| 2 | Q1 10 | X | Y |
| 2 | Q2 10 | X | Y |
| 2 | Q3 10 | X | Y |
| 2 | Q4 10 | X | Y |
| 3 | Q1 09 | X | Y |
| 3 | Q2 09 | X | Y |
| 3 | Q3 09 | X | Y |
| 3 | Q4 09 | X | Y |
| 3 | Q1 10 | X | Y |
| 3 | Q2 10 | X | Y |
| 3 | Q3 10 | X | Y |
| 3 | Q4 10 | X | Y |
| 4 | Q1 09 | X | Y |
| 4 | Q2 09 | X | Y |
| 4 | Q3 09 | X | Y |
| 4 | Q4 09 | X | Y |
| 4 | Q1 10 | X | Y |
| 4 | Q2 10 | X | Y |
| 4 | Q3 10 | X | Y |
| 4 | Q4 10 | X | Y |
| 5 | Q1 09 | X | Y |
| 5 | Q2 09 | X | Y |
| 5 | Q3 09 | X | Y |
| 5 | Q4 09 | X | Y |
| 5 | Q1 10 | X | Y |
| 5 | Q2 10 | X | Y |
| 5 | Q3 10 | X | Y |
| 5 | Q4 10 | X | Y |
| 6 | Q1 09 | X | Y |
| 6 | Q2 09 | X | Y |
| 6 | Q3 09 | X | Y |
| 6 | Q4 09 | X | Y |
| 6 | Q1 10 | X | Y |
| 6 | Q2 10 | X | Y |
| 6 | Q3 10 | X | Y |
| 6 | Q4 10 | X | Y |
| 7 | Q1 09 | X | Y |
| 7 | Q2 09 | X | Y |
| 7 | Q3 09 | X | Y |
| 7 | Q4 09 | X | Y |
| 7 | Q1 10 | X | Y |
| 7 | Q2 10 | X | Y |
| 7 | Q3 10 | X | Y |
| 7 | Q4 10 | X | Y |
| 8 | Q1 09 | X | Y |
| 8 | Q2 09 | X | Y |
| 8 | Q3 09 | X | Y |
| 8 | Q4 09 | X | Y |
| 8 | Q1 10 | X | Y |
| 8 | Q2 10 | X | Y |
| 8 | Q3 10 | X | Y |
| 8 | Q4 10 | X | Y |
| 9 | Q1 09 | X | Y |
| 9 | Q2 09 | X | Y |
| 9 | Q3 09 | X | Y |
| 9 | Q4 09 | X | Y |
| 9 | Q1 10 | X | Y |
| 9 | Q2 10 | X | Y |
| 9 | Q3 10 | X | Y |
| 9 | Q4 10 | X | Y |
+----+-------+------------+------------+
The data set is very large, several thousands data points in total.
I previosuly posted regarding this issue, but I guess I did not formulate the question properly. i i tried the code below, but it left out the Y column.
df.columns = [
df.columns.to_series().groupby(level=0).cumcount().map({0: 'X', 1: 'Y'}),
df.columns
]
df.stack().rename_axis(['ID', 'T']).reset_index()
We can split your dataframe by variables (X, Y) with df.filter. Then we melt your columns to rows with .melt. Finally we concat the seperated data together again with pd.concat.
The last line is to clean up your T column by removing everything with format:
variable X
X = df[['ID']].join(df.filter(regex='X$')).melt(id_vars='ID', var_name='T', value_name='Variable X')
Y = df[['ID']].join(df.filter(regex='Y$')).melt(id_vars='ID', var_name='T', value_name='Variable Y')
df = pd.concat([X, Y[['Variable Y']]], axis=1).sort_values(['ID', 'T']).reset_index(drop=True)
df['T'] = df['T'].str.replace('\svariable\s[A-Za-z]', '')
Output
ID T Variable X Variable Y
0 1 Q1 09 X Y
1 1 Q1 10 X Y
2 1 Q2 09 X Y
3 1 Q2 10 X Y
4 1 Q3 09 X Y
5 1 Q3 10 X Y
6 1 Q4 09 X Y
7 1 Q4 10 X Y
8 2 Q1 09 X Y
9 2 Q1 10 X Y
10 2 Q2 09 X Y
11 2 Q2 10 X Y
12 2 Q3 09 X Y
13 2 Q3 10 X Y
14 2 Q4 09 X Y
15 2 Q4 10 X Y
16 3 Q1 09 X Y
17 3 Q1 10 X Y
18 3 Q2 09 X Y
19 3 Q2 10 X Y
20 3 Q3 09 X Y
21 3 Q3 10 X Y
22 3 Q4 09 X Y
23 3 Q4 10 X Y
24 4 Q1 09 X Y
25 4 Q1 10 X Y
26 4 Q2 09 X Y
27 4 Q2 10 X Y
28 4 Q3 09 X Y
29 4 Q3 10 X Y
.. .. ... ... ...
42 6 Q2 09 X Y
43 6 Q2 10 X Y
44 6 Q3 09 X Y
45 6 Q3 10 X Y
46 6 Q4 09 X Y
47 6 Q4 10 X Y
48 7 Q1 09 X Y
49 7 Q1 10 X Y
50 7 Q2 09 X Y
51 7 Q2 10 X Y
52 7 Q3 09 X Y
53 7 Q3 10 X Y
54 7 Q4 09 X Y
55 7 Q4 10 X Y
56 8 Q1 09 X Y
57 8 Q1 10 X Y
58 8 Q2 09 X Y
59 8 Q2 10 X Y
60 8 Q3 09 X Y
61 8 Q3 10 X Y
62 8 Q4 09 X Y
63 8 Q4 10 X Y
64 9 Q1 09 X Y
65 9 Q1 10 X Y
66 9 Q2 09 X Y
67 9 Q2 10 X Y
68 9 Q3 09 X Y
69 9 Q3 10 X Y
70 9 Q4 09 X Y
71 9 Q4 10 X Y
[72 rows x 4 columns]
Create MultiIndex by all columns with variables, so possible reshape by DataFrame.stack - both T columns are joined together:
df = df.set_index('ID')
df.columns = df.columns.str.split(n=2, expand=True)
df = df.stack([0,1]).rename_axis(('ID','T','T1')).reset_index()
df['T'] = df['T'] + ' ' + df.pop('T1')
print (df.head(10))
ID T variable X variable Y
0 1 Q1 09 X Y
1 1 Q1 10 X Y
2 1 Q2 09 X Y
3 1 Q2 10 X Y
4 1 Q3 09 X Y
5 1 Q3 10 X Y
6 1 Q4 09 X Y
7 1 Q4 10 X Y
8 2 Q1 09 X Y
9 2 Q1 10 X Y
Or if possible create MultiIndex by first 5 values and another values of columns:
df = df.set_index('ID')
df.columns = pd.MultiIndex.from_tuples(zip(df.columns.str[:5], df.columns.str[5:]))
df = df.stack(0).rename_axis(('ID','T')).reset_index()
If order is important create column with ordered categorical and last sort values:
df = df.set_index('ID')
c = df.columns.str[:5]
df.columns = pd.MultiIndex.from_tuples(zip(c, df.columns.str[5:]))
df1 = df.stack(0).rename_axis(('ID','T')).reset_index()
df1['T'] = pd.CategoricalIndex(df1['T'], ordered=True, categories=c.unique())
df1 = df1.sort_values(['ID','T'])
print (df1.head(10))
ID T variable X variable Y
0 1 Q1 09 X Y
2 1 Q2 09 X Y
4 1 Q3 09 X Y
6 1 Q4 09 X Y
1 1 Q1 10 X Y
3 1 Q2 10 X Y
5 1 Q3 10 X Y
7 1 Q4 10 X Y
8 2 Q1 09 X Y
10 2 Q2 09 X Y
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With