How to populate columns of a dataframe using a subset of another dataframe?

Tags:

I have two dataframes like this

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'key': list('AAABBCCAAC'),
    'prop1': list('xyzuuyxzzz'),
    'prop2': list('mnbnbbnnnn')
})

df2 = pd.DataFrame({
    'key': list('ABBCAA'),
    'prop1': [np.nan] * 6,
    'prop2': [np.nan] * 6,
    'keep_me': ['stuff'] * 6
})

  key prop1 prop2
0   A     x     m
1   A     y     n
2   A     z     b
3   B     u     n
4   B     u     b
5   C     y     b
6   C     x     n
7   A     z     n
8   A     z     n
9   C     z     n

  key  prop1  prop2 keep_me
0   A    NaN    NaN   stuff
1   B    NaN    NaN   stuff
2   B    NaN    NaN   stuff
3   C    NaN    NaN   stuff
4   A    NaN    NaN   stuff
5   A    NaN    NaN   stuff

I now want to populate columns prop1 and prop2 in df2 using the values of df1. For each key, we will have more or equal rows in df1 than in df2 (in the example above: 5 times A vs 3 times A, 2 times B vs 2 times B and 3 times C vs 1 time C). For each key, I want to fill df2 using the first n rows per key from df1.

So, my expected outcome for df2 would be:

  key prop1 prop2 keep_me
0   A     x     m   stuff
1   B     u     n   stuff
2   B     u     b   stuff
3   C     y     b   stuff
4   A     y     n   stuff
5   A     z     b   stuff

As key is not unique, I cannot simple build a dictionary and then use .map.

I was hoping that something along these lines would work:

pd.concat([df2.set_index('key'), df1.set_index('key')], axis=1, join='inner')

but that fails with

ValueError: Shape of passed values is (5, 22), indices imply (5, 10)

as - I guess - the index contains non-unique values.

How can I get my desired output?

475

asked Sep 04 '19 11:09

Cleb

1 Answers

Because duplicates in key values possible solution is create new counter columns in both DataFrames by GroupBy.cumcount, so possible replace missing values from df2 with align by MultiIndex created by key and g columns with DataFrame.fillna:

df1['g'] = df1.groupby('key').cumcount()
df2['g'] = df2.groupby('key').cumcount()

print (df1)
  key prop1 prop2  g
0   A     x     m  0
1   A     y     n  1
2   A     z     b  2
3   B     u     n  0
4   B     u     b  1
5   C     y     b  0
6   C     x     n  1
7   A     z     n  3
8   A     z     n  4
9   C     z     n  2

print (df2)
  key  prop1  prop2 keep_me  g
0   A    NaN    NaN   stuff  0
1   B    NaN    NaN   stuff  0
2   B    NaN    NaN   stuff  1
3   C    NaN    NaN   stuff  0
4   A    NaN    NaN   stuff  1
5   A    NaN    NaN   stuff  2

df = (df2.set_index(['key','g'])
        .fillna(df1.set_index(['key','g']))
        .reset_index(level=1, drop=True)
        .reset_index())
print (df)
  key prop1 prop2 keep_me
0   A     x     m   stuff
1   B     u     n   stuff
2   B     u     b   stuff
3   C     y     b   stuff
4   A     y     n   stuff
5   A     z     b   stuff

169

answered Sep 27 '22 21:09

jezrael

Related questions
                            
                                How to rotate matplotlib bar-chart?
                            
                                Calling super().__init__() gives the wrong method when it is overridden
                            
                                Modify request data before view functions in Flask API
                            
                                Install OpenEXR for python failed
                            
                                how to add custom Keras model in OpenCv in python
                            
                                How to merge 2 ROS topics into 1?
                            
                                How to create a simple flag in Python
                            
                                Pip upgrade failed with ERROR: Invalid requirement: '\xe2\x80\x93upgrade'
                            
                                Using automatic differentiation libraries to compute partial derivatives of an arbitrary tensor
                            
                                how to start a task at a specific time with django & celery
                            
                                two for loops in one list comprehension separately [duplicate]
                            
                                Using different sample weights for each output in a multi-output Keras model
                            
                                How to generate WM_SEC.AUTH_SIGNATURE?
                            
                                How do I make something print if the input is a number is between 2 different numbers?
                            
                                How to parametrize Hypothesis strategy in @given
                            
                                How to deactivate venv while using fish?
                            
                                How To Annotate an Argument That Can Be Cast As Boolean?
                            
                                How to apply json_normalize on entire pandas column
                            
                                Can't import torch in jupyter notebook
                            
                                How to add prefix to multi index columns at particular level?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to populate columns of a dataframe using a subset of another dataframe?

Tags:

python

pandas

dataframe

Cleb

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us