I'm new to pandas have tried going through the docs and experiment with various examples, but this problem I'm tacking has really stumped me.
I have the following two dataframes (DataA/DataB) which I would like to merge on a per global_index/item/values basis.
DataA                      DataB
row  item_id  valueA       row    item_id  valueB
0    x        A1           0      x        B1
1    y        A2           1      y        B2
2    z        A3           2      x        B3
3    x        A4           3      y        B4
4    z        A5           4      z        B5
5    x        A6           5      x        B6
6    y        A7           6      y        B7
7    z        A8           7      z        B8
The list of items(item_ids) is finite and each of the two dataframes represent a the value of a trait (trait A, trait B) for an item at a given global_index value.
The global_index could roughly be thought of as a unit of "time"
The mapping between each data frame (DataA/DataB) and the global_index is done via the following two mapper DFs:
DataA_mapper
global_index  start_row  num_rows
0             0          3
1             3          2
3             5          3
DataB_mapper
global_index  start_row  num_rows
0             0          2
2             2          3
4             5          3
Simply put for a given global_index (eg: 1) the mapper will define a list of rows into the respective DFs (DataA or DataB) that are associated with that global_index.
For example, for a global_index value of 0:
Another example, for a global_index value of 2:
The ranges [start_row,start_row + num_rows) presented do not overlap each other and represent a unique sequence/range of rows in their respective dataframes (DataA, DataB)
In short no row in either DataA or DataB will be found in more than one range.
I would like to merge the DFs so that I get the following dataframe:
row   global_index  item_id   valueA   valueB
0     0             x         A1        B1
1     0             y         A2        B2
2     0             z         A3        NaN
3     1             x         A4        B1
4     1             z         A5        NaN
5     2             x         A4        B3
6     2             y         A2        B4
7     2             z         A5        NaN
8     3             x         A6        B3
9     3             y         A7        B4
10    3             z         A8        B5
11    4             x         A6        B6
12    4             y         A7        B7
13    4             z         A8        B8
In the final datafram any pair of global_index/item_id there will ever be either:
With the requirement being if there is only one value for a given global_index/item (eg: valueA but no valueB) for the last value of the missing one to be used.
First, you can create the 'global_index' column using the function pd.cut:
for df, m in [(df_A, map_A), (df_B, map_B)]:
    bins = np.insert(m['num_rows'].cumsum().values, 0, 0) # create bins and add zero at the beginning
    df['global_index'] = pd.cut(df['row'], bins=bins, labels=m['global_index'], right=False)
Next, you can use outer join to merge both data frames:
df = df_A.merge(df_B, on=['global_index', 'item_id'], how='outer')
And finally you can use functions groupby and ffill to fill missing values:
for val in ['valueA', 'valueB']:
    df[val] = df.groupby('item_id')[val].ffill()
Output:
   item_id  global_index  valueA  valueB
0        x             0      A1      B1
1        y             0      A2      B2
2        z             0      A3     NaN
3        x             1      A4      B1
4        z             1      A5     NaN
5        x             3      A6      B1
6        y             3      A7      B2
7        z             3      A8     NaN
8        x             2      A6      B3
9        y             2      A7      B4
10       z             2      A8      B5
11       x             4      A6      B6
12       y             4      A7      B7
13       z             4      A8      B8
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With