merge pandas dataframe with key duplicates

Tags:

I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, but in such a way that when both have the same duplicate those duplicates are merged respectively. In addition if one dataframe has more duplicates of a key than the other, I'd like it's values to be filled as NaN. For example:

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'],
                    'A':   ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}, 
                   columns=['key', 'A'])
df2 = pd.DataFrame({'B':   ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'],
                    'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']}, 
                   columns=['key', 'B'])

  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K2  A3
4  K2  A4
5  K3  A5

  key   B
0  K0  B0
1  K1  B1
2  K2  B2
3  K2  B3
4  K3  B4
5  K3  B5
6  K4  B6

I'm trying to get the following output

   key    A   B
0   K0   A0  B0
1   K1   A1  B1
2   K2   A2  B2
3   K2   A3  B3
6   K2   A4  NaN
8   K3   A5  B4
9   K3  NaN  B5
10  K4  NaN  B6

So basically, I'd like to treat the duplicated K2 keys as K2_1, K2_2, ... and then do the how='outer' merge on the dataframes. Any ideas how I can accomplish this?

935

asked Nov 13 '16 15:11

dcmm88

1 Answers

faster again

%%cython
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np

def cg(x):
    cnt = defaultdict(lambda: 0)

    for j in x.tolist():
        cnt[j] += 1
        yield cnt[j]


def fastcount(x):
    return [i for i in cg(x)]

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

faster answer; not scalable

def fastcount(x):
    unq, inv = np.unique(x, return_inverse=1)
    m = np.arange(len(unq))[:, None] == inv
    return (m.cumsum(1) * m).sum(0)

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

old answer

df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()

df1.merge(df2, how='outer').drop('cc', 1)

enter image description here

191

answered Nov 10 '22 21:11

piRSquared

Related questions
                            
                                Cannot subclass multiprocessing Queue in Python 3.5
                            
                                Match unicode emoji in python regex
                            
                                How can I execute Python code in a virtualenv from Matlab
                            
                                fatal error: 'QTKit/QTKit.h' file not found when I build OpenCV on mac
                            
                                Parsing DeepDiff result
                            
                                How can I speed up nearest neighbor search with python?
                            
                                scrapy-splash returns its own headers and not the original headers from the site
                            
                                comparing strings in list to strings in list
                            
                                Large amount of multiprocessing.Process causing deadlock
                            
                                What is the pythonic way to skip parent method?
                            
                                parameters for google cloud natural language api
                            
                                pandas slicing multiindex dataframe
                            
                                SymPy - Solving for variable in equation
                            
                                Issue when imoporting GDAL : ImportError, Library not loaded, Image not found
                            
                                Python with tcpdump in a subprocess: how to close subprocess properly?
                            
                                How to set MIP start (initial solution) with Gurobi solver from PuLP?
                            
                                Django back end with a React front end
                            
                                How to change Spark setting to allow spark.dynamicAllocation.enabled?
                            
                                Can we limit APScheduler to run 100 times only?
                            
                                How to force a new line when appending to a csv using python pandas .to_csv

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

merge pandas dataframe with key duplicates

Tags:

python

merge

pandas

dataframe

dcmm88

People also ask

1 Answers

piRSquared

Recent Activity

Donate For Us