Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to map two dataframe with output of overlapping items in new columns?

I have two dataframes:

data = {
    'values': ['Cricket', 'Soccer', 'Football', 'Tennis', 'Badminton', 'Chess'],
    'gems': ['A1K, A2M, JA3, AN4', 'B1, A1, Bn2, B3', 'CD1, A1', 'KWS, KQM', 'JP, CVK', 'KF, GF']  
}
df1 = pd.DataFrame(data)

df1

    values       gems
0   Cricket      A1K, A2M, JA3, AN4
1   Soccer       B1, A1, Bn2, B3
2   Football     CD1, A1
3   Tennis       KWS, KQM
4   Badminton    JP, CVK
5   Chess        KF, GF

second dataframe

data2 = {
    '1C': ['B1', 'K1', 'A1K', 'J1', 'A4'],
    '02C': ['Bn2', 'B3', 'JK', 'ZZ', 'ko'],
    '34C': ['KF', 'CD1', 'B3','ji', 'HU']
}
df2 = pd.DataFrame(data2)

df2

    1C  02C 34C
0   B1  Bn2 KF
1   K1  B3  CD1
2   A1K JK  B3
3   J1  ZZ  ji
4   A4  ko  HU

I want check items in df1['gems'] in each column of df2 and represent their counts and overlapping items. The expected output is:

    values    gems                  1C  1CGroup   02C   02CGroup    34C 34CGroup
0   Cricket   A1K, A2M, JA3, AN4    1   A1K       0     NA          0   NA
1   Soccer    B1, A1, Bn2, B3       1   Bn2       2     Bn2, B3     1   B3
2   Football  CD1, A1               0   NA        0     NA          1   CD1
3   Tennis    KWS, KQM              0   NA        0     NA          0   NA
4   Badminton JP, CVK               0   NA        0     NA          0   NA
5   Chess     KF, GF                0   NA        0     NA          1   KF
like image 846
svp Avatar asked Jul 29 '21 14:07

svp


People also ask

How do I combine two DataFrame columns?

By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame.

Which function is used to combine the elements of two DataFrame?

One way to combine or concatenate DataFrames is concat() function. It can be used to concatenate DataFrames along rows or columns by changing the axis parameter. The default value of the axis parameter is 0, which indicates combining along rows.

How to create new columns based on other columns in Dataframe?

We can easily create new columns based on other columns using the DataFrame’s withColumn () method. For example, if the column num is of type double, we can create a new column num_div_10 like so: But now, we want to set values for our new column based on certain conditions.

How to plot multiples data columns in a Dataframe in Python?

Using plot () method, specify a single column along X-axis and multiple columns as an array along Y-axis. Display graph. Below are few examples which illustrates the above approach to plot multiples data columns in a Dataframe.

How do I map a column in a pandas Dataframe?

You can apply the Pandas .map () method can be applied to a Pandas Series, meaning it can be applied to a Pandas DataFrame column. The map function is interesting because it can take three different shapes. This varies depending on what you pass into the method.

What does it mean to map external values to a Dataframe?

Mapping external value to a dataframe means using different sets of values to add in that dataframe by keeping the keys of external dictionary as same as the one column of that dataframe.


Video Answer


4 Answers

first str.split and explode the column gems and reset_index to keep the original index. Then for each column of df2, merge with the exploded gems, groupby the original index and do both the count and the aggregation as you want with join. pd.concat the merges for each column and join to your original df1. fillna the count columns with 0 as shown in the expected output.

# one row per gem used in the merge
df_ = df1['gems'].str.split(', ').explode().reset_index()

res = (
    df1.join( #can join to df1 as we keep the original index value
        pd.concat([df_.merge(df2[[col]], left_on='gems', right_on=col)
                      .groupby('index') # original index in df1
                      [col].agg(**{col: 'count', # do each aggregation
                                   f'{col}Group':lambda x: ', '.join(x)}) 
                   for col in df2.columns], # do it for each column of df2
                  axis=1))
        .fillna({col:0 for col in df2.columns}) #fill the count columns with 0
)
print(res)
      values                gems   1C 1CGroup  02C 02CGroup  34C 34CGroup
0    Cricket  A1K, A2M, JA3, AN4  1.0     A1K  0.0      NaN  0.0      NaN
1     Soccer     B1, A1, Bn2, B3  1.0      B1  2.0  Bn2, B3  1.0       B3
2   Football             CD1, A1  0.0     NaN  0.0      NaN  1.0      CD1
3     Tennis            KWS, KQM  0.0     NaN  0.0      NaN  0.0      NaN
4  Badminton             JP, CVK  0.0     NaN  0.0      NaN  0.0      NaN
5      Chess              KF, GF  0.0     NaN  0.0      NaN  1.0       KF
like image 149
Ben.T Avatar answered Oct 12 '22 03:10

Ben.T


Solution with findall

For each column in df2, find all the occurrences of the column value in the gems column of df1, then map with len to count the occurrences and optionally join with str.join

for c in df2.columns:
    s = df1['gems'].str.findall('|'.join(df2[c]))

    df1[c] = s.map(len)
    df1[c + 'group'] = s.str.join(', ')

print(df1)

      values                gems  1C 1Cgroup  02C 02Cgroup  34C 34Cgroup
0    Cricket  A1K, A2M, JA3, AN4   1     A1K    0             0         
1     Soccer     B1, A1, Bn2, B3   1      B1    2  Bn2, B3    1       B3
2   Football             CD1, A1   0            0             1      CD1
3     Tennis            KWS, KQM   0            0             0         
4  Badminton             JP, CVK   0            0             0         
5      Chess              KF, GF   0            0             1       KF
like image 5
Shubham Sharma Avatar answered Oct 12 '22 03:10

Shubham Sharma


First create a table of your groups:

df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
                df2.unstack().reset_index(level=0),
                left_on='gems', right_on=0, how='left'
               )
         .pivot_table(index='index',
                      columns=['level_0'],
                      values='gems',
                      aggfunc=list)
      )

output:

level_0        02C     1C    34C
index                           
0              NaN  [A1K]    NaN
1        [Bn2, B3]   [B1]   [B3]
2              NaN    NaN  [CD1]
5              NaN    NaN   [KF]

Then produce the counts and concatenate everything with the original table:

pd.concat([df1,
           pd.concat([df3.add_suffix('Group').applymap(lambda x: ','.join(x) if isinstance(x, list) else x),
                      df3.fillna('').applymap(len)],
                     axis=1).sort_index(axis=1)
          ], axis=1)

output:

      values                gems  02C 02CGroup   1C 1CGroup  34C 34CGroup
0    Cricket  A1K, A2M, JA3, AN4  0.0      NaN  1.0     A1K  0.0      NaN
1     Soccer     B1, A1, Bn2, B3  2.0  Bn2, B3  1.0      B1  1.0       B3
2   Football             CD1, A1  0.0      NaN  0.0     NaN  1.0      CD1
3     Tennis            KWS, KQM  NaN      NaN  NaN     NaN  NaN      NaN
4  Badminton             JP, CVK  NaN      NaN  NaN     NaN  NaN      NaN
5      Chess              KF, GF  0.0      NaN  0.0     NaN  1.0       KF

edit: alternative for the string join and count

df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
                df2.unstack().reset_index(level=0),
                left_on='gems', right_on=0, how='left'
               )
         .pivot_table(index='index',
                      columns=['level_0'],
                      values='gems',
                      aggfunc=', '.join)
      )

pd.concat([df1,
           pd.concat([df3.add_suffix('Group'),
                      df3.applymap(lambda x: x.count(',')+1 if isinstance(x, str) else 0)],
                     axis=1).sort_index(axis=1)
          ], axis=1)
like image 5
mozway Avatar answered Oct 12 '22 03:10

mozway


Worst solution using set and apply:

df1.gems =  df1.gems.str.split(', ')

df3 = df2.T
ix = 0

def func(row):
    global ix
    d = {}
    for idx, val in enumerate(df3.values):
        v = list(set(row) & set(val))
        d[df3.index[idx]] = ', '.join(v)
        d[f"{df3.index[idx]}Group"] = len(v) 
    ix = ix + 1
    return pd.Series(d)
res = pd.concat([df1,df1['gems'].apply(func)], axis=1)

Concise solution:

df1.gems =  df1.gems.str.split(', ')
for col in df2.columns:
    z = (zip(df1.gems, [df2[col].values] * len(df1)))
    res = ([', '.join(list(set(a).intersection(b))) for a, b in z])
    df1[col] = res
    df1[f"{col}Group"] = (list(map(lambda x: len(x.split(', ')) if x!='' else 0, res)))

res:

values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 Cricket [A1K, A2M, JA3, AN4] A1K 1 0 0
1 Soccer [B1, A1, Bn2, B3] B1 1 B3, Bn2 2 B3 1
2 Football [CD1, A1] 0 0 CD1 1
3 Tennis [KWS, KQM] 0 0 0
4 Badminton [JP, CVK] 0 0 0
5 Chess [KF, GF] 0 0 KF 1
like image 3
Pygirl Avatar answered Oct 12 '22 01:10

Pygirl