Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: concatenate dataframes, forward-fill and multiindex on column data

Tags:

python

pandas

I have 2 csv files with the same column names, but different values.

The first column is the index (time) and one of the data columns is a unique identifier (id)

The index (time) is different for each csv file.

I have read the data into 2 dataframes using read_csv, giving me the following:

        +-------+------+-------+
        | id    | size | price |
+-------+-------+------+-------+
| time  |       |      |       |
+-------+-------+------+-------+
| t0    | ID1   | 10   | 110   |
| t2    | ID1   | 12   | 109   |
| t6    | ID1   | 20   | 108   |
+-------+-------+------+-------+

        +-------+------+-------+
        | id    | size | price |
+-------+-------+------+-------+
| time  |       |      |       |
+-------+-------+------+-------+
| t1    | ID2   |  9   |  97   |
| t3    | ID2   | 15   |  94   |
| t5    | ID2   | 13   | 100   |
+-------+-------+------+-------+

I would like to create a single large dataframe with entries for both, and use ffill to forward fill values from the previous time-step.

I am able to achieve this using a combination of concat, sort and ffill.

However, it requires renaming the columns of one of the dataframes first, so that there aren't name clashes

df2.columns = [ 'id', 'id2_size', 'id2_price' ]
df = pd.concat([df1, df2]).sort().ffill()

This results in the following dataframe:

        +------+------+-------+----------+-----------+
        | id   | size | price | id2_size | id2_price |
+-------+------+------+-------+----------+-----------+
| time  |      |      |       |          |           |
+-------+------+------+-------+----------+-----------+
| t0    | ID1  | 10   | 110   |     nan  |     nan   |
| t1    | ID2  | 10   | 110   |      9   |      97   |
| t2    | ID1  | 12   | 109   |      9   |      97   |
| t3    | ID2  | 12   | 109   |     15   |      94   |
| t5    | ID2  | 12   | 109   |     13   |     100   |
| t6    | ID1  | 20   | 108   |     13   |     100   |
+-------+------+------+-------+----------+-----------+

My current method is fairly klunky in that I have to rename the columns of one of the dataframes.

I believe a better way to represent the data would be use a multiindex with the 2nd dimension's value coming from the id column.

The resulting dataframe would look like this:

        +--------------+--------------+
        | ID1          | ID2          |
        +------+-------+------+-------+
        | size | price | size | price |
+-------+------+-------+------+-------+
| time  |      |       |      |       |
+-------+------+-------+------+-------+
| t0    | 10   | 110   | nan  | nan   |
| t1    | 10   | 110   |  9   |  97   |
| t2    | 12   | 109   |  9   |  97   |
| t3    | 12   | 109   | 15   |  94   |
| t5    | 12   | 109   | 13   | 100   |
| t6    | 20   | 108   | 13   | 100   |
+-------+------+-------+------+-------+

Is this possible?
If so, what steps would be required to go from the 2 dataframes read from csv, to the final merged multiindexed dataframe?

like image 932
Steve Lorimer Avatar asked Nov 09 '22 13:11

Steve Lorimer


1 Answers

Here's a one-liner that does what you ask, although it's a bit convoluted in terms of stacking/unstacking:

df1.append(df2).set_index(['time','id']).sort().stack().unstack(level=[1,2]).ffill()

id    ID1        ID2      
     size price size price
time                      
t0     10   110  NaN   NaN
t1     10   110    9    97
t2     12   109    9    97
t3     12   109   15    94
t5     12   109   13   100
t6     20   108   13   100

FWIW, my default approach would have been something like the following, which is a little more straightforward (less stacking/unstacking) and would give you the same basic results, but with a different column organization:

df1.append(df2).set_index(['time','id']).sort().unstack().ffill()

     size     price     
id    ID1 ID2   ID1  ID2
time                    
t0     10 NaN   110  NaN
t1     10   9   110   97
t2     12   9   109   97
t3     12  15   109   94
t5     12  13   109  100
t6     20  13   108  100

And along those lines, you could then add swaplevel and sort to get the columns reorganized to be like in the first approach:

df1.append(df2).set_index(['time','id']).sort().unstack().ffill().swaplevel(0,1,axis=1).sort(axis=1)
like image 193
JohnE Avatar answered Nov 14 '22 21:11

JohnE