I have two dataframes:
data = {
'values': ['Cricket', 'Soccer', 'Football', 'Tennis', 'Badminton', 'Chess'],
'gems': ['A1K, A2M, JA3, AN4', 'B1, A1, Bn2, B3', 'CD1, A1', 'KWS, KQM', 'JP, CVK', 'KF, GF']
}
df1 = pd.DataFrame(data)
df1
values gems
0 Cricket A1K, A2M, JA3, AN4
1 Soccer B1, A1, Bn2, B3
2 Football CD1, A1
3 Tennis KWS, KQM
4 Badminton JP, CVK
5 Chess KF, GF
second dataframe
data2 = {
'1C': ['B1', 'K1', 'A1K', 'J1', 'A4'],
'02C': ['Bn2', 'B3', 'JK', 'ZZ', 'ko'],
'34C': ['KF', 'CD1', 'B3','ji', 'HU']
}
df2 = pd.DataFrame(data2)
df2
1C 02C 34C
0 B1 Bn2 KF
1 K1 B3 CD1
2 A1K JK B3
3 J1 ZZ ji
4 A4 ko HU
I want check items in df1['gems']
in each column of df2
and represent their counts and overlapping items. The expected output is:
values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 NA 0 NA
1 Soccer B1, A1, Bn2, B3 1 Bn2 2 Bn2, B3 1 B3
2 Football CD1, A1 0 NA 0 NA 1 CD1
3 Tennis KWS, KQM 0 NA 0 NA 0 NA
4 Badminton JP, CVK 0 NA 0 NA 0 NA
5 Chess KF, GF 0 NA 0 NA 1 KF
By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame.
One way to combine or concatenate DataFrames is concat() function. It can be used to concatenate DataFrames along rows or columns by changing the axis parameter. The default value of the axis parameter is 0, which indicates combining along rows.
We can easily create new columns based on other columns using the DataFrame’s withColumn () method. For example, if the column num is of type double, we can create a new column num_div_10 like so: But now, we want to set values for our new column based on certain conditions.
Using plot () method, specify a single column along X-axis and multiple columns as an array along Y-axis. Display graph. Below are few examples which illustrates the above approach to plot multiples data columns in a Dataframe.
You can apply the Pandas .map () method can be applied to a Pandas Series, meaning it can be applied to a Pandas DataFrame column. The map function is interesting because it can take three different shapes. This varies depending on what you pass into the method.
Mapping external value to a dataframe means using different sets of values to add in that dataframe by keeping the keys of external dictionary as same as the one column of that dataframe.
first str.split
and explode
the column gems and reset_index
to keep the original index. Then for each column of df2, merge
with the exploded gems, groupby
the original index and do both the count
and the aggregation as you want
with join
. pd.concat
the merges for each column and join to your original df1. fillna
the count columns with 0 as shown in the expected output.
# one row per gem used in the merge
df_ = df1['gems'].str.split(', ').explode().reset_index()
res = (
df1.join( #can join to df1 as we keep the original index value
pd.concat([df_.merge(df2[[col]], left_on='gems', right_on=col)
.groupby('index') # original index in df1
[col].agg(**{col: 'count', # do each aggregation
f'{col}Group':lambda x: ', '.join(x)})
for col in df2.columns], # do it for each column of df2
axis=1))
.fillna({col:0 for col in df2.columns}) #fill the count columns with 0
)
print(res)
values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 1.0 A1K 0.0 NaN 0.0 NaN
1 Soccer B1, A1, Bn2, B3 1.0 B1 2.0 Bn2, B3 1.0 B3
2 Football CD1, A1 0.0 NaN 0.0 NaN 1.0 CD1
3 Tennis KWS, KQM 0.0 NaN 0.0 NaN 0.0 NaN
4 Badminton JP, CVK 0.0 NaN 0.0 NaN 0.0 NaN
5 Chess KF, GF 0.0 NaN 0.0 NaN 1.0 KF
findall
For each column in df2
, find all the occurrences of the column value in the gems column of df1
, then map
with len
to count the occurrences and optionally join
with str.join
for c in df2.columns:
s = df1['gems'].str.findall('|'.join(df2[c]))
df1[c] = s.map(len)
df1[c + 'group'] = s.str.join(', ')
print(df1)
values gems 1C 1Cgroup 02C 02Cgroup 34C 34Cgroup
0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 0
1 Soccer B1, A1, Bn2, B3 1 B1 2 Bn2, B3 1 B3
2 Football CD1, A1 0 0 1 CD1
3 Tennis KWS, KQM 0 0 0
4 Badminton JP, CVK 0 0 0
5 Chess KF, GF 0 0 1 KF
First create a table of your groups:
df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
df2.unstack().reset_index(level=0),
left_on='gems', right_on=0, how='left'
)
.pivot_table(index='index',
columns=['level_0'],
values='gems',
aggfunc=list)
)
output:
level_0 02C 1C 34C
index
0 NaN [A1K] NaN
1 [Bn2, B3] [B1] [B3]
2 NaN NaN [CD1]
5 NaN NaN [KF]
Then produce the counts and concatenate everything with the original table:
pd.concat([df1,
pd.concat([df3.add_suffix('Group').applymap(lambda x: ','.join(x) if isinstance(x, list) else x),
df3.fillna('').applymap(len)],
axis=1).sort_index(axis=1)
], axis=1)
output:
values gems 02C 02CGroup 1C 1CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 0.0 NaN 1.0 A1K 0.0 NaN
1 Soccer B1, A1, Bn2, B3 2.0 Bn2, B3 1.0 B1 1.0 B3
2 Football CD1, A1 0.0 NaN 0.0 NaN 1.0 CD1
3 Tennis KWS, KQM NaN NaN NaN NaN NaN NaN
4 Badminton JP, CVK NaN NaN NaN NaN NaN NaN
5 Chess KF, GF 0.0 NaN 0.0 NaN 1.0 KF
edit: alternative for the string join and count
df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
df2.unstack().reset_index(level=0),
left_on='gems', right_on=0, how='left'
)
.pivot_table(index='index',
columns=['level_0'],
values='gems',
aggfunc=', '.join)
)
pd.concat([df1,
pd.concat([df3.add_suffix('Group'),
df3.applymap(lambda x: x.count(',')+1 if isinstance(x, str) else 0)],
axis=1).sort_index(axis=1)
], axis=1)
Worst solution using set
and apply
:
df1.gems = df1.gems.str.split(', ')
df3 = df2.T
ix = 0
def func(row):
global ix
d = {}
for idx, val in enumerate(df3.values):
v = list(set(row) & set(val))
d[df3.index[idx]] = ', '.join(v)
d[f"{df3.index[idx]}Group"] = len(v)
ix = ix + 1
return pd.Series(d)
res = pd.concat([df1,df1['gems'].apply(func)], axis=1)
Concise solution:
df1.gems = df1.gems.str.split(', ')
for col in df2.columns:
z = (zip(df1.gems, [df2[col].values] * len(df1)))
res = ([', '.join(list(set(a).intersection(b))) for a, b in z])
df1[col] = res
df1[f"{col}Group"] = (list(map(lambda x: len(x.split(', ')) if x!='' else 0, res)))
res:
values | gems | 1C | 1CGroup | 02C | 02CGroup | 34C | 34CGroup | |
---|---|---|---|---|---|---|---|---|
0 | Cricket | [A1K, A2M, JA3, AN4] | A1K | 1 | 0 | 0 | ||
1 | Soccer | [B1, A1, Bn2, B3] | B1 | 1 | B3, Bn2 | 2 | B3 | 1 |
2 | Football | [CD1, A1] | 0 | 0 | CD1 | 1 | ||
3 | Tennis | [KWS, KQM] | 0 | 0 | 0 | |||
4 | Badminton | [JP, CVK] | 0 | 0 | 0 | |||
5 | Chess | [KF, GF] | 0 | 0 | KF | 1 |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With