I have 2 csv files with the same column names, but different values.
The first column is the index (time
) and one of the data columns is a unique identifier (id
)
The index (time
) is different for each csv file.
I have read the data into 2 dataframes using read_csv
, giving me the following:
+-------+------+-------+
| id | size | price |
+-------+-------+------+-------+
| time | | | |
+-------+-------+------+-------+
| t0 | ID1 | 10 | 110 |
| t2 | ID1 | 12 | 109 |
| t6 | ID1 | 20 | 108 |
+-------+-------+------+-------+
+-------+------+-------+
| id | size | price |
+-------+-------+------+-------+
| time | | | |
+-------+-------+------+-------+
| t1 | ID2 | 9 | 97 |
| t3 | ID2 | 15 | 94 |
| t5 | ID2 | 13 | 100 |
+-------+-------+------+-------+
I would like to create a single large dataframe with entries for both, and use ffill
to forward fill values from the previous time-step.
I am able to achieve this using a combination of concat
, sort
and ffill
.
However, it requires renaming the columns of one of the dataframes first, so that there aren't name clashes
df2.columns = [ 'id', 'id2_size', 'id2_price' ]
df = pd.concat([df1, df2]).sort().ffill()
This results in the following dataframe:
+------+------+-------+----------+-----------+
| id | size | price | id2_size | id2_price |
+-------+------+------+-------+----------+-----------+
| time | | | | | |
+-------+------+------+-------+----------+-----------+
| t0 | ID1 | 10 | 110 | nan | nan |
| t1 | ID2 | 10 | 110 | 9 | 97 |
| t2 | ID1 | 12 | 109 | 9 | 97 |
| t3 | ID2 | 12 | 109 | 15 | 94 |
| t5 | ID2 | 12 | 109 | 13 | 100 |
| t6 | ID1 | 20 | 108 | 13 | 100 |
+-------+------+------+-------+----------+-----------+
My current method is fairly klunky in that I have to rename the columns of one of the dataframes.
I believe a better way to represent the data would be use a multiindex with the 2nd dimension's value coming from the id
column.
The resulting dataframe would look like this:
+--------------+--------------+
| ID1 | ID2 |
+------+-------+------+-------+
| size | price | size | price |
+-------+------+-------+------+-------+
| time | | | | |
+-------+------+-------+------+-------+
| t0 | 10 | 110 | nan | nan |
| t1 | 10 | 110 | 9 | 97 |
| t2 | 12 | 109 | 9 | 97 |
| t3 | 12 | 109 | 15 | 94 |
| t5 | 12 | 109 | 13 | 100 |
| t6 | 20 | 108 | 13 | 100 |
+-------+------+-------+------+-------+
Is this possible?
If so, what steps would be required to go from the 2 dataframes read from csv, to the final merged multiindexed dataframe?
Here's a one-liner that does what you ask, although it's a bit convoluted in terms of stacking/unstacking:
df1.append(df2).set_index(['time','id']).sort().stack().unstack(level=[1,2]).ffill()
id ID1 ID2
size price size price
time
t0 10 110 NaN NaN
t1 10 110 9 97
t2 12 109 9 97
t3 12 109 15 94
t5 12 109 13 100
t6 20 108 13 100
FWIW, my default approach would have been something like the following, which is a little more straightforward (less stacking/unstacking) and would give you the same basic results, but with a different column organization:
df1.append(df2).set_index(['time','id']).sort().unstack().ffill()
size price
id ID1 ID2 ID1 ID2
time
t0 10 NaN 110 NaN
t1 10 9 110 97
t2 12 9 109 97
t3 12 15 109 94
t5 12 13 109 100
t6 20 13 108 100
And along those lines, you could then add swaplevel
and sort
to get the columns reorganized to be like in the first approach:
df1.append(df2).set_index(['time','id']).sort().unstack().ffill().swaplevel(0,1,axis=1).sort(axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With