I'm new to pandas have tried going through the docs and experiment with various examples, but this problem I'm tacking has really stumped me.
I have the following two dataframes (DataA/DataB) which I would like to merge on a per global_index/item/values basis.
DataA DataB
row item_id valueA row item_id valueB
0 x A1 0 x B1
1 y A2 1 y B2
2 z A3 2 x B3
3 x A4 3 y B4
4 z A5 4 z B5
5 x A6 5 x B6
6 y A7 6 y B7
7 z A8 7 z B8
The list of items(item_ids) is finite and each of the two dataframes represent a the value of a trait (trait A, trait B) for an item at a given global_index value.
The global_index could roughly be thought of as a unit of "time"
The mapping between each data frame (DataA/DataB) and the global_index is done via the following two mapper DFs:
DataA_mapper
global_index start_row num_rows
0 0 3
1 3 2
3 5 3
DataB_mapper
global_index start_row num_rows
0 0 2
2 2 3
4 5 3
Simply put for a given global_index (eg: 1) the mapper will define a list of rows into the respective DFs (DataA or DataB) that are associated with that global_index.
For example, for a global_index value of 0:
Another example, for a global_index value of 2:
The ranges [start_row,start_row + num_rows) presented do not overlap each other and represent a unique sequence/range of rows in their respective dataframes (DataA, DataB)
In short no row in either DataA or DataB will be found in more than one range.
I would like to merge the DFs so that I get the following dataframe:
row global_index item_id valueA valueB
0 0 x A1 B1
1 0 y A2 B2
2 0 z A3 NaN
3 1 x A4 B1
4 1 z A5 NaN
5 2 x A4 B3
6 2 y A2 B4
7 2 z A5 NaN
8 3 x A6 B3
9 3 y A7 B4
10 3 z A8 B5
11 4 x A6 B6
12 4 y A7 B7
13 4 z A8 B8
In the final datafram any pair of global_index/item_id there will ever be either:
With the requirement being if there is only one value for a given global_index/item (eg: valueA but no valueB) for the last value of the missing one to be used.
First, you can create the 'global_index'
column using the function pd.cut
:
for df, m in [(df_A, map_A), (df_B, map_B)]:
bins = np.insert(m['num_rows'].cumsum().values, 0, 0) # create bins and add zero at the beginning
df['global_index'] = pd.cut(df['row'], bins=bins, labels=m['global_index'], right=False)
Next, you can use outer join to merge
both data frames:
df = df_A.merge(df_B, on=['global_index', 'item_id'], how='outer')
And finally you can use functions groupby
and ffill
to fill missing values:
for val in ['valueA', 'valueB']:
df[val] = df.groupby('item_id')[val].ffill()
Output:
item_id global_index valueA valueB
0 x 0 A1 B1
1 y 0 A2 B2
2 z 0 A3 NaN
3 x 1 A4 B1
4 z 1 A5 NaN
5 x 3 A6 B1
6 y 3 A7 B2
7 z 3 A8 NaN
8 x 2 A6 B3
9 y 2 A7 B4
10 z 2 A8 B5
11 x 4 A6 B6
12 y 4 A7 B7
13 z 4 A8 B8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With