I have a large data set i need to conduct time-series analysis on. The data is currently in excel formated as below:
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
| ID | Q1 09 variable X | Q2 09 variable X | Q3 09 variable X | Q4 09 variable X | Q1 10 variable X | Q2 10 variable X | Q3 10 variable X | Q4 10 variable X | Q1 09 variable Y | Q2 09 variable Y | Q3 09 variable Y | Q4 09 variable Y | Q1 10 variable Y | Q2 10 variable Y | Q3 10 variable Y | Q4 10 variable Y |
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
| 1 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 2 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 3 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 4 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 5 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 6 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 7 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 8 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
| 9 | X | X | X | X | X | X | X | X | Y | Y | Y | Y | Y | Y | Y | Y |
+----+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
The above table is an example. The dataset contains data on all quarters up to Q1 2019. I need the data to look like this:
+----+-------+------------+------------+
| ID | T | Variable X | Variable Y |
+----+-------+------------+------------+
| 1 | Q1 09 | X | Y |
| 1 | Q2 09 | X | Y |
| 1 | Q3 09 | X | Y |
| 1 | Q4 09 | X | Y |
| 1 | Q1 10 | X | Y |
| 1 | Q2 10 | X | Y |
| 1 | Q3 10 | X | Y |
| 1 | Q4 10 | X | Y |
| 2 | Q1 09 | X | Y |
| 2 | Q2 09 | X | Y |
| 2 | Q3 09 | X | Y |
| 2 | Q4 09 | X | Y |
| 2 | Q1 10 | X | Y |
| 2 | Q2 10 | X | Y |
| 2 | Q3 10 | X | Y |
| 2 | Q4 10 | X | Y |
| 3 | Q1 09 | X | Y |
| 3 | Q2 09 | X | Y |
| 3 | Q3 09 | X | Y |
| 3 | Q4 09 | X | Y |
| 3 | Q1 10 | X | Y |
| 3 | Q2 10 | X | Y |
| 3 | Q3 10 | X | Y |
| 3 | Q4 10 | X | Y |
| 4 | Q1 09 | X | Y |
| 4 | Q2 09 | X | Y |
| 4 | Q3 09 | X | Y |
| 4 | Q4 09 | X | Y |
| 4 | Q1 10 | X | Y |
| 4 | Q2 10 | X | Y |
| 4 | Q3 10 | X | Y |
| 4 | Q4 10 | X | Y |
| 5 | Q1 09 | X | Y |
| 5 | Q2 09 | X | Y |
| 5 | Q3 09 | X | Y |
| 5 | Q4 09 | X | Y |
| 5 | Q1 10 | X | Y |
| 5 | Q2 10 | X | Y |
| 5 | Q3 10 | X | Y |
| 5 | Q4 10 | X | Y |
| 6 | Q1 09 | X | Y |
| 6 | Q2 09 | X | Y |
| 6 | Q3 09 | X | Y |
| 6 | Q4 09 | X | Y |
| 6 | Q1 10 | X | Y |
| 6 | Q2 10 | X | Y |
| 6 | Q3 10 | X | Y |
| 6 | Q4 10 | X | Y |
| 7 | Q1 09 | X | Y |
| 7 | Q2 09 | X | Y |
| 7 | Q3 09 | X | Y |
| 7 | Q4 09 | X | Y |
| 7 | Q1 10 | X | Y |
| 7 | Q2 10 | X | Y |
| 7 | Q3 10 | X | Y |
| 7 | Q4 10 | X | Y |
| 8 | Q1 09 | X | Y |
| 8 | Q2 09 | X | Y |
| 8 | Q3 09 | X | Y |
| 8 | Q4 09 | X | Y |
| 8 | Q1 10 | X | Y |
| 8 | Q2 10 | X | Y |
| 8 | Q3 10 | X | Y |
| 8 | Q4 10 | X | Y |
| 9 | Q1 09 | X | Y |
| 9 | Q2 09 | X | Y |
| 9 | Q3 09 | X | Y |
| 9 | Q4 09 | X | Y |
| 9 | Q1 10 | X | Y |
| 9 | Q2 10 | X | Y |
| 9 | Q3 10 | X | Y |
| 9 | Q4 10 | X | Y |
+----+-------+------------+------------+
The data set is very large, several thousands data points in total.
I previosuly posted regarding this issue, but I guess I did not formulate the question properly. i i tried the code below, but it left out the Y column.
df.columns = [
df.columns.to_series().groupby(level=0).cumcount().map({0: 'X', 1: 'Y'}),
df.columns
]
df.stack().rename_axis(['ID', 'T']).reset_index()
We can split your dataframe by variables (X, Y) with df.filter
. Then we melt your columns to rows with .melt
. Finally we concat the seperated data together again with pd.concat
.
The last line is to clean up your T
column by removing everything with format:
variable X
X = df[['ID']].join(df.filter(regex='X$')).melt(id_vars='ID', var_name='T', value_name='Variable X')
Y = df[['ID']].join(df.filter(regex='Y$')).melt(id_vars='ID', var_name='T', value_name='Variable Y')
df = pd.concat([X, Y[['Variable Y']]], axis=1).sort_values(['ID', 'T']).reset_index(drop=True)
df['T'] = df['T'].str.replace('\svariable\s[A-Za-z]', '')
Output
ID T Variable X Variable Y
0 1 Q1 09 X Y
1 1 Q1 10 X Y
2 1 Q2 09 X Y
3 1 Q2 10 X Y
4 1 Q3 09 X Y
5 1 Q3 10 X Y
6 1 Q4 09 X Y
7 1 Q4 10 X Y
8 2 Q1 09 X Y
9 2 Q1 10 X Y
10 2 Q2 09 X Y
11 2 Q2 10 X Y
12 2 Q3 09 X Y
13 2 Q3 10 X Y
14 2 Q4 09 X Y
15 2 Q4 10 X Y
16 3 Q1 09 X Y
17 3 Q1 10 X Y
18 3 Q2 09 X Y
19 3 Q2 10 X Y
20 3 Q3 09 X Y
21 3 Q3 10 X Y
22 3 Q4 09 X Y
23 3 Q4 10 X Y
24 4 Q1 09 X Y
25 4 Q1 10 X Y
26 4 Q2 09 X Y
27 4 Q2 10 X Y
28 4 Q3 09 X Y
29 4 Q3 10 X Y
.. .. ... ... ...
42 6 Q2 09 X Y
43 6 Q2 10 X Y
44 6 Q3 09 X Y
45 6 Q3 10 X Y
46 6 Q4 09 X Y
47 6 Q4 10 X Y
48 7 Q1 09 X Y
49 7 Q1 10 X Y
50 7 Q2 09 X Y
51 7 Q2 10 X Y
52 7 Q3 09 X Y
53 7 Q3 10 X Y
54 7 Q4 09 X Y
55 7 Q4 10 X Y
56 8 Q1 09 X Y
57 8 Q1 10 X Y
58 8 Q2 09 X Y
59 8 Q2 10 X Y
60 8 Q3 09 X Y
61 8 Q3 10 X Y
62 8 Q4 09 X Y
63 8 Q4 10 X Y
64 9 Q1 09 X Y
65 9 Q1 10 X Y
66 9 Q2 09 X Y
67 9 Q2 10 X Y
68 9 Q3 09 X Y
69 9 Q3 10 X Y
70 9 Q4 09 X Y
71 9 Q4 10 X Y
[72 rows x 4 columns]
Create MultiIndex
by all columns with variables, so possible reshape by DataFrame.stack
- both T
columns are joined together:
df = df.set_index('ID')
df.columns = df.columns.str.split(n=2, expand=True)
df = df.stack([0,1]).rename_axis(('ID','T','T1')).reset_index()
df['T'] = df['T'] + ' ' + df.pop('T1')
print (df.head(10))
ID T variable X variable Y
0 1 Q1 09 X Y
1 1 Q1 10 X Y
2 1 Q2 09 X Y
3 1 Q2 10 X Y
4 1 Q3 09 X Y
5 1 Q3 10 X Y
6 1 Q4 09 X Y
7 1 Q4 10 X Y
8 2 Q1 09 X Y
9 2 Q1 10 X Y
Or if possible create MultiIndex
by first 5 values and another values of columns:
df = df.set_index('ID')
df.columns = pd.MultiIndex.from_tuples(zip(df.columns.str[:5], df.columns.str[5:]))
df = df.stack(0).rename_axis(('ID','T')).reset_index()
If order is important create column with ordered categorical and last sort values:
df = df.set_index('ID')
c = df.columns.str[:5]
df.columns = pd.MultiIndex.from_tuples(zip(c, df.columns.str[5:]))
df1 = df.stack(0).rename_axis(('ID','T')).reset_index()
df1['T'] = pd.CategoricalIndex(df1['T'], ordered=True, categories=c.unique())
df1 = df1.sort_values(['ID','T'])
print (df1.head(10))
ID T variable X variable Y
0 1 Q1 09 X Y
2 1 Q2 09 X Y
4 1 Q3 09 X Y
6 1 Q4 09 X Y
1 1 Q1 10 X Y
3 1 Q2 10 X Y
5 1 Q3 10 X Y
7 1 Q4 10 X Y
8 2 Q1 09 X Y
10 2 Q2 09 X Y
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With