Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count instances of matched strings and cumulative total values

It is difficult to describe this on a heading but given these two DataFrames:

import pandas as pd
import numpy as np
import re


df1 = pd.DataFrame({
'url': [
  'http://google.com/car', 
  'http://google.com/moto', 
  'http://google.com/moto-bike'
], 'value': [3, 4, 6]})

url                           value
http://google.com/car         3
http://google.com/moto        4
http://google.com/moto-bike   6

df2 = pd.DataFrame({'name': ['car','moto','bus']})

  name
0 car
1 moto
2 bus

I want to see how many times the name on df2 appears on the url for df1, and have sort of managed with:

df2['instances'] = pd.Series([df1.url.str.contains(fr'\D{w}\D', regex=True) \
.sum() for w in df2.name.tolist()])

For some reason car has zero instances cause there is only one.

   name  instances
0   car          0
1  moto          2
2   bus          0

What I would like to be able to do is to have another column that sums the value column of all matches of df1, so it looks like this:

   name  instances  value_total
0   car          1           3
1  moto          2          10
2   bus          0           0

Any help on the right direction would be greatly appreciated, thanks!

like image 644
Álvaro Avatar asked Dec 05 '25 15:12

Álvaro


2 Answers

try with str.extract then merge and groupby with named aggregation (new in pandas 0.25+):

pat = '|'.join(df2['name']) #'car|moto|bus'
m = df2.merge(df1.assign(name=df1['url']
            .str.extract('('+ pat + ')', expand=False)),on='name',how='left')
m = m.groupby('name',sort=False).agg(instances=('value','count')
                 ,value_total=('value','sum')).reset_index()

print(m)

   name  instances  value_total
0   car          1          3.0
1  moto          2         10.0
2   bus          0          0.0
like image 177
anky Avatar answered Dec 07 '25 05:12

anky


here's a similair version of anky's answer using .loc, groupby & merge

pat = '|'.join(df2['name'])
df1.loc[df1['url'].str.contains(f'({pat})'),'name'] = df1['url'].str.extract(f'({pat})')[0]

vals = (
    df1.groupby("name")
    .agg({"name": "count", "value": "sum"})
    .rename(columns={"name": "instance"})
    .reset_index()
)

new_df = pd.merge(df2,vals,on='name',how='left').fillna(0)

print(new_df)
   name  instance  value
0   car       1.0    3.0
1  moto       2.0   10.0
2   bus       0.0    0.0

edit, if you need an extact match of car then we can add word boundaries:

pat = r'|'.join(np.where(df2['name'].str.contains('car'),
                     r'\b' + df2['name'] + r'\b', df2['name']))
print(df1)
                          url  value 
0       http://google.com/car      3   
1     http://google.com/motor      4  
2  http://google.com/carousel      6  
3       http://google.com/bus      8 

df1.loc[df1['url'].str.contains(f'{pat}'),'name'] = df1['url'].str.extract(f'({pat})')[0]
print(df1)
                          url  value  name
0       http://google.com/car      3   car
1     http://google.com/motor      4  moto
2  http://google.com/carousel      6   NaN
3       http://google.com/bus      8   bus

if you want exact matches for all then just add word boundries to pattern :

pat = '|'.join(r'\b' + df2['name'] + r'\b')
#'\\bcar\\b|\\bmoto\\b|\\bbus\\b'
like image 27
Umar.H Avatar answered Dec 07 '25 03:12

Umar.H