So I have a dataframe (or series) where there are always 4 occurrences of each of column 'A', like this:
df = pd.DataFrame([['foo'],
['foo'],
['foo'],
['foo'],
['bar'],
['bar'],
['bar'],
['bar']],
columns=['A'])
A
0 foo
1 foo
2 foo
3 foo
4 bar
5 bar
6 bar
7 bar
I also have another dataframe, with values like the ones found in column A, but they don't always have 4 values. They also have more columns, like this:
df_key = pd.DataFrame([['foo', 1, 2],
['foo', 3, 4],
['bar', 5, 9],
['bar', 2, 4],
['bar', 1, 9]],
columns=['A', 'B', 'C'])
A B C
0 foo 1 2
1 foo 3 4
2 bar 5 9
3 bar 2 4
4 bar 1 9
I wanted to merge them such they end up like this using something like:
df.merge(df_key, how='left', on='A', copy=False)
A B C
0 foo 1 2
1 foo 3 4
2 foo NaN NaN
3 foo NaN NaN
4 bar 5 9
5 bar 2 4
6 bar 1 9
7 bar NaN NaN
But instead I end up with something like this. Any advice?
A B C
0 foo 1 2
1 foo 3 4
2 foo 1 2
3 foo 3 4
4 foo 1 2
5 foo 3 4
6 foo 1 2
7 foo 3 4
8 bar 5 9
9 bar 2 4
10 bar 1 9
11 bar 5 9
12 bar 2 4
13 bar 1 9
14 bar 5 9
15 bar 2 4
16 bar 1 9
17 bar 5 9
18 bar 2 4
19 bar 1 9
You'll need to create surrogate columns with groupby
+ cumcount
to deduplicate your rows, then include those columns when calling merge
:
a = df.assign(D=df.groupby('A').cumcount())
b = df_key.assign(D=df_key.groupby('A').cumcount())
a.merge(b, on=['A', 'D'], how='left').drop('D', 1)
A B C
0 foo 1.0 2.0
1 foo 3.0 4.0
2 foo NaN NaN
3 foo NaN NaN
4 bar 5.0 9.0
5 bar 2.0 4.0
6 bar 1.0 9.0
7 bar NaN NaN
Or you can just repeat the column A of df_key
the remaining number of times from df
.
s=df.A.value_counts()-df_key.A.value_counts()
pd.concat([df_key,pd.DataFrame({'A':s.index.repeat(s)})]).sort_values('A')
Out[469]:
A B C
2 bar 5.0 9.0
3 bar 2.0 4.0
4 bar 1.0 9.0
0 bar NaN NaN
0 foo 1.0 2.0
1 foo 3.0 4.0
1 foo NaN NaN
2 foo NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With