<p>I have two dataframes:</p> <pre class="prettyprint"><code>data = { 'values': ['Cricket', 'Soccer', 'Football', 'Tennis', 'Badminton', 'Chess'], 'gems': ['A1K, A2M, JA3, AN4', 'B1, A1, Bn2, B3', 'CD1, A1', 'KWS, KQM', 'JP, CVK', 'KF, GF'] } df1 = pd.DataFrame(data) </code></pre> <p>df1</p> <pre class="prettyprint"><code> values gems 0 Cricket A1K, A2M, JA3, AN4 1 Soccer B1, A1, Bn2, B3 2 Football CD1, A1 3 Tennis KWS, KQM 4 Badminton JP, CVK 5 Chess KF, GF </code></pre> <p>second dataframe</p> <pre class="prettyprint"><code>data2 = { '1C': ['B1', 'K1', 'A1K', 'J1', 'A4'], '02C': ['Bn2', 'B3', 'JK', 'ZZ', 'ko'], '34C': ['KF', 'CD1', 'B3','ji', 'HU'] } df2 = pd.DataFrame(data2) </code></pre> <p>df2</p> <pre class="prettyprint"><code> 1C 02C 34C 0 B1 Bn2 KF 1 K1 B3 CD1 2 A1K JK B3 3 J1 ZZ ji 4 A4 ko HU </code></pre> <p>I want check items in <code>df1['gems']</code> in each column of <code>df2</code> and represent their counts and overlapping items. The expected output is:</p> <pre class="prettyprint"><code> values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup 0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 NA 0 NA 1 Soccer B1, A1, Bn2, B3 1 Bn2 2 Bn2, B3 1 B3 2 Football CD1, A1 0 NA 0 NA 1 CD1 3 Tennis KWS, KQM 0 NA 0 NA 0 NA 4 Badminton JP, CVK 0 NA 0 NA 0 NA 5 Chess KF, GF 0 NA 0 NA 1 KF </code></pre>

<h3>Solution with <code>findall</code> </h3> <p>For each column in <code>df2</code>, find all the occurrences of the column value in the gems column of <code>df1</code>, then <code>map</code> with <code>len</code> to count the occurrences and optionally <code>join</code> with <code>str.join</code></p> <pre class="prettyprint"><code>for c in df2.columns: s = df1['gems'].str.findall('|'.join(df2[c])) df1[c] = s.map(len) df1[c + 'group'] = s.str.join(', ') </code></pre> <hr> <pre class="prettyprint"><code>print(df1) values gems 1C 1Cgroup 02C 02Cgroup 34C 34Cgroup 0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 0 1 Soccer B1, A1, Bn2, B3 1 B1 2 Bn2, B3 1 B3 2 Football CD1, A1 0 0 1 CD1 3 Tennis KWS, KQM 0 0 0 4 Badminton JP, CVK 0 0 0 5 Chess KF, GF 0 0 1 KF </code></pre>

<p>First create a table of your groups:</p> <pre class="prettyprint"><code>df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(), df2.unstack().reset_index(level=0), left_on='gems', right_on=0, how='left' ) .pivot_table(index='index', columns=['level_0'], values='gems', aggfunc=list) ) </code></pre> <p>output:</p> <pre class="prettyprint"><code>level_0 02C 1C 34C index 0 NaN [A1K] NaN 1 [Bn2, B3] [B1] [B3] 2 NaN NaN [CD1] 5 NaN NaN [KF] </code></pre> <p>Then produce the counts and concatenate everything with the original table:</p> <pre class="prettyprint"><code>pd.concat([df1, pd.concat([df3.add_suffix('Group').applymap(lambda x: ','.join(x) if isinstance(x, list) else x), df3.fillna('').applymap(len)], axis=1).sort_index(axis=1) ], axis=1) </code></pre> <p>output:</p> <pre class="prettyprint"><code> values gems 02C 02CGroup 1C 1CGroup 34C 34CGroup 0 Cricket A1K, A2M, JA3, AN4 0.0 NaN 1.0 A1K 0.0 NaN 1 Soccer B1, A1, Bn2, B3 2.0 Bn2, B3 1.0 B1 1.0 B3 2 Football CD1, A1 0.0 NaN 0.0 NaN 1.0 CD1 3 Tennis KWS, KQM NaN NaN NaN NaN NaN NaN 4 Badminton JP, CVK NaN NaN NaN NaN NaN NaN 5 Chess KF, GF 0.0 NaN 0.0 NaN 1.0 KF </code></pre> <p>edit: alternative for the string join and count</p> <pre class="prettyprint"><code>df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(), df2.unstack().reset_index(level=0), left_on='gems', right_on=0, how='left' ) .pivot_table(index='index', columns=['level_0'], values='gems', aggfunc=', '.join) ) pd.concat([df1, pd.concat([df3.add_suffix('Group'), df3.applymap(lambda x: x.count(',')+1 if isinstance(x, str) else 0)], axis=1).sort_index(axis=1) ], axis=1) </code></pre>

<p>Worst solution using <code>set</code> and <code>apply</code>:</p> <pre class="prettyprint"><code>df1.gems = df1.gems.str.split(', ') df3 = df2.T ix = 0 def func(row): global ix d = {} for idx, val in enumerate(df3.values): v = list(set(row) & set(val)) d[df3.index[idx]] = ', '.join(v) d[f"{df3.index[idx]}Group"] = len(v) ix = ix + 1 return pd.Series(d) res = pd.concat([df1,df1['gems'].apply(func)], axis=1) </code></pre> <p><strong>Concise solution:</strong></p> <pre class="prettyprint"><code>df1.gems = df1.gems.str.split(', ') for col in df2.columns: z = (zip(df1.gems, [df2[col].values] * len(df1))) res = ([', '.join(list(set(a).intersection(b))) for a, b in z]) df1[col] = res df1[f"{col}Group"] = (list(map(lambda x: len(x.split(', ')) if x!='' else 0, res))) </code></pre> <p><strong>res:</strong></p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: right;"></th> <th style="text-align: right;">values</th> <th style="text-align: right;">gems</th> <th style="text-align: right;">1C</th> <th style="text-align: right;">1CGroup</th> <th style="text-align: right;">02C</th> <th style="text-align: right;">02CGroup</th> <th style="text-align: right;">34C</th> <th style="text-align: right;">34CGroup</th> </tr></thead> <tbody> <tr> <td style="text-align: right;">0</td> <td style="text-align: right;">Cricket</td> <td style="text-align: right;">[A1K, A2M, JA3, AN4]</td> <td style="text-align: right;">A1K</td> <td style="text-align: right;">1</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> </tr> <tr> <td style="text-align: right;">1</td> <td style="text-align: right;">Soccer</td> <td style="text-align: right;">[B1, A1, Bn2, B3]</td> <td style="text-align: right;">B1</td> <td style="text-align: right;">1</td> <td style="text-align: right;">B3, Bn2</td> <td style="text-align: right;">2</td> <td style="text-align: right;">B3</td> <td style="text-align: right;">1</td> </tr> <tr> <td style="text-align: right;">2</td> <td style="text-align: right;">Football</td> <td style="text-align: right;">[CD1, A1]</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;">CD1</td> <td style="text-align: right;">1</td> </tr> <tr> <td style="text-align: right;">3</td> <td style="text-align: right;">Tennis</td> <td style="text-align: right;">[KWS, KQM]</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> </tr> <tr> <td style="text-align: right;">4</td> <td style="text-align: right;">Badminton</td> <td style="text-align: right;">[JP, CVK]</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> </tr> <tr> <td style="text-align: right;">5</td> <td style="text-align: right;">Chess</td> <td style="text-align: right;">[KF, GF]</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;"></td> <td style="text-align: right;">0</td> <td style="text-align: right;">KF</td> <td style="text-align: right;">1</td> </tr> </tbody> </table> </div>

How to map two dataframe with output of overlapping items in new columns?

Tags:

python

pandas

dataframe

I have two dataframes:

data = {
    'values': ['Cricket', 'Soccer', 'Football', 'Tennis', 'Badminton', 'Chess'],
    'gems': ['A1K, A2M, JA3, AN4', 'B1, A1, Bn2, B3', 'CD1, A1', 'KWS, KQM', 'JP, CVK', 'KF, GF']  
}
df1 = pd.DataFrame(data)

df1

    values       gems
0   Cricket      A1K, A2M, JA3, AN4
1   Soccer       B1, A1, Bn2, B3
2   Football     CD1, A1
3   Tennis       KWS, KQM
4   Badminton    JP, CVK
5   Chess        KF, GF

second dataframe

data2 = {
    '1C': ['B1', 'K1', 'A1K', 'J1', 'A4'],
    '02C': ['Bn2', 'B3', 'JK', 'ZZ', 'ko'],
    '34C': ['KF', 'CD1', 'B3','ji', 'HU']
}
df2 = pd.DataFrame(data2)

df2

    1C  02C 34C
0   B1  Bn2 KF
1   K1  B3  CD1
2   A1K JK  B3
3   J1  ZZ  ji
4   A4  ko  HU

I want check items in df1['gems'] in each column of df2 and represent their counts and overlapping items. The expected output is:

    values    gems                  1C  1CGroup   02C   02CGroup    34C 34CGroup
0   Cricket   A1K, A2M, JA3, AN4    1   A1K       0     NA          0   NA
1   Soccer    B1, A1, Bn2, B3       1   Bn2       2     Bn2, B3     1   B3
2   Football  CD1, A1               0   NA        0     NA          1   CD1
3   Tennis    KWS, KQM              0   NA        0     NA          0   NA
4   Badminton JP, CVK               0   NA        0     NA          0   NA
5   Chess     KF, GF                0   NA        0     NA          1   KF

846

asked Jul 29 '21 14:07

svp

Video Answer

4 Answers

first str.split and explode the column gems and reset_index to keep the original index. Then for each column of df2, merge with the exploded gems, groupby the original index and do both the count and the aggregation as you want with join. pd.concat the merges for each column and join to your original df1. fillna the count columns with 0 as shown in the expected output.

# one row per gem used in the merge
df_ = df1['gems'].str.split(', ').explode().reset_index()

res = (
    df1.join( #can join to df1 as we keep the original index value
        pd.concat([df_.merge(df2[[col]], left_on='gems', right_on=col)
                      .groupby('index') # original index in df1
                      [col].agg(**{col: 'count', # do each aggregation
                                   f'{col}Group':lambda x: ', '.join(x)}) 
                   for col in df2.columns], # do it for each column of df2
                  axis=1))
        .fillna({col:0 for col in df2.columns}) #fill the count columns with 0
)
print(res)
      values                gems   1C 1CGroup  02C 02CGroup  34C 34CGroup
0    Cricket  A1K, A2M, JA3, AN4  1.0     A1K  0.0      NaN  0.0      NaN
1     Soccer     B1, A1, Bn2, B3  1.0      B1  2.0  Bn2, B3  1.0       B3
2   Football             CD1, A1  0.0     NaN  0.0      NaN  1.0      CD1
3     Tennis            KWS, KQM  0.0     NaN  0.0      NaN  0.0      NaN
4  Badminton             JP, CVK  0.0     NaN  0.0      NaN  0.0      NaN
5      Chess              KF, GF  0.0     NaN  0.0      NaN  1.0       KF

149

answered Oct 12 '22 03:10

Ben.T

Solution with `findall`

For each column in df2, find all the occurrences of the column value in the gems column of df1, then map with len to count the occurrences and optionally join with str.join

for c in df2.columns:
    s = df1['gems'].str.findall('|'.join(df2[c]))

    df1[c] = s.map(len)
    df1[c + 'group'] = s.str.join(', ')

print(df1)

      values                gems  1C 1Cgroup  02C 02Cgroup  34C 34Cgroup
0    Cricket  A1K, A2M, JA3, AN4   1     A1K    0             0         
1     Soccer     B1, A1, Bn2, B3   1      B1    2  Bn2, B3    1       B3
2   Football             CD1, A1   0            0             1      CD1
3     Tennis            KWS, KQM   0            0             0         
4  Badminton             JP, CVK   0            0             0         
5      Chess              KF, GF   0            0             1       KF

answered Oct 12 '22 03:10

Shubham Sharma

First create a table of your groups:

df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
                df2.unstack().reset_index(level=0),
                left_on='gems', right_on=0, how='left'
               )
         .pivot_table(index='index',
                      columns=['level_0'],
                      values='gems',
                      aggfunc=list)
      )

output:

level_0        02C     1C    34C
index                           
0              NaN  [A1K]    NaN
1        [Bn2, B3]   [B1]   [B3]
2              NaN    NaN  [CD1]
5              NaN    NaN   [KF]

Then produce the counts and concatenate everything with the original table:

pd.concat([df1,
           pd.concat([df3.add_suffix('Group').applymap(lambda x: ','.join(x) if isinstance(x, list) else x),
                      df3.fillna('').applymap(len)],
                     axis=1).sort_index(axis=1)
          ], axis=1)

output:

      values                gems  02C 02CGroup   1C 1CGroup  34C 34CGroup
0    Cricket  A1K, A2M, JA3, AN4  0.0      NaN  1.0     A1K  0.0      NaN
1     Soccer     B1, A1, Bn2, B3  2.0  Bn2, B3  1.0      B1  1.0       B3
2   Football             CD1, A1  0.0      NaN  0.0     NaN  1.0      CD1
3     Tennis            KWS, KQM  NaN      NaN  NaN     NaN  NaN      NaN
4  Badminton             JP, CVK  NaN      NaN  NaN     NaN  NaN      NaN
5      Chess              KF, GF  0.0      NaN  0.0     NaN  1.0       KF

edit: alternative for the string join and count

df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
                df2.unstack().reset_index(level=0),
                left_on='gems', right_on=0, how='left'
               )
         .pivot_table(index='index',
                      columns=['level_0'],
                      values='gems',
                      aggfunc=', '.join)
      )

pd.concat([df1,
           pd.concat([df3.add_suffix('Group'),
                      df3.applymap(lambda x: x.count(',')+1 if isinstance(x, str) else 0)],
                     axis=1).sort_index(axis=1)
          ], axis=1)

answered Oct 12 '22 03:10

mozway

Worst solution using set and apply:

df1.gems =  df1.gems.str.split(', ')

df3 = df2.T
ix = 0

def func(row):
    global ix
    d = {}
    for idx, val in enumerate(df3.values):
        v = list(set(row) & set(val))
        d[df3.index[idx]] = ', '.join(v)
        d[f"{df3.index[idx]}Group"] = len(v) 
    ix = ix + 1
    return pd.Series(d)
res = pd.concat([df1,df1['gems'].apply(func)], axis=1)

Concise solution:

df1.gems =  df1.gems.str.split(', ')
for col in df2.columns:
    z = (zip(df1.gems, [df2[col].values] * len(df1)))
    res = ([', '.join(list(set(a).intersection(b))) for a, b in z])
    df1[col] = res
    df1[f"{col}Group"] = (list(map(lambda x: len(x.split(', ')) if x!='' else 0, res)))

res:

	values	gems	1C	1CGroup	02C	02CGroup	34C	34CGroup
0	Cricket	[A1K, A2M, JA3, AN4]	A1K	1		0		0
1	Soccer	[B1, A1, Bn2, B3]	B1	1	B3, Bn2	2	B3	1
2	Football	[CD1, A1]		0		0	CD1	1
3	Tennis	[KWS, KQM]		0		0		0
4	Badminton	[JP, CVK]		0		0		0
5	Chess	[KF, GF]		0		0	KF	1

answered Oct 12 '22 01:10

Pygirl

Related questions
                            
                                Python Global Variables - Not Defined?
                            
                                python3 for unit test: AttributeError: module '__main__' has no attribute "kernel..."
                            
                                Is it possible to use `element.click()` on Selenium with Chrome even on headless mode?
                            
                                io.StringIO vs open() in Python 3
                            
                                Python. Variable in while loop not updating.
                            
                                Min value in each column of a data frame excluding zeros
                            
                                Is there a way I can initialize dictionary values to 0 in python taking keys from a list? [duplicate]
                            
                                How to fix Field defines a relation with the model 'auth.User', which has been swapped out
                            
                                how to use pd.cut() across columns of a data frame?
                            
                                How to filter Pandas Dataframe rows which contains any string from a list?
                            
                                Custom Keras Data Generator with yield
                            
                                kubernetes Python API Client: execute full yaml file
                            
                                Visualize TFLite graph and get intermediate values of a particular node?
                            
                                numpy faster than numba and cython , how to improve numba code
                            
                                experimental_list_devices attribute missing in tensorflow_core._api.v2.config
                            
                                ValueError: No gradients provided for any variable - Tensorflow 2.0/Keras
                            
                                Can't install PyQt5 on python 3 with spyder ide
                            
                                E053 Could not read config.cfg Resumeparser
                            
                                How to install h5py (needed for Keras) on MacOS with M1?
                            
                                What is the pythonic/idiomatic way of filtering the output of a generator expression?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to map two dataframe with output of overlapping items in new columns?

Tags:

python

pandas

dataframe

svp

People also ask

Video Answer

4 Answers

Ben.T

Solution with `findall`

Shubham Sharma

mozway

Pygirl

Recent Activity

Donate For Us

How to map two dataframe with output of overlapping items in new columns?

Tags:

python

pandas

dataframe

svp

People also ask

Video Answer

4 Answers

Ben.T

Solution with findall

Shubham Sharma

mozway

Pygirl

Related questions

Recent Activity

Donate For Us

Solution with `findall`