i have dataframe with each row having a list value.
id list_of_value
0 ['a','b','c']
1 ['d','b','c']
2 ['a','b','c']
3 ['a','b','c']
i have to do a calculate a score with one row and against all the other rows
For eg:
Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 ,
resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id.size
repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids.
and create a N x N dataframe; such as this:
- 0 1 2 3
0 1 0.6 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Right now my code has just one for loop:
def scoreCalc(x,queryTData):
#mathematical calculation
commonTData = np.intersect1d(np.array(x),queryTData)
return commonTData.size/queryTData.size
ids = list(df['feed_id'])
dfSim = pd.DataFrame()
for indexQFID in range(len(ids)):
queryTData = np.array(df.loc[df['id'] == ids[indexQFID]]['list_of_value'].values.tolist())
dfSim[segmentDfFeedIds[indexQFID]] = segmentDf['list_of_value'].apply(scoreCalc,args=(queryTData,))
Is there a better way to do this? can i just write one apply function instead doing a for-loop iteration. can i make it faster?
In this post, you learned many different ways of creating columns in Pandas. This can be done by directly inserting data, applying mathematical operations to columns, and by working with strings. To learn more about string operations like split, check out the official documentation here.
If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas.DataFrame.apply () method should do the trick. For example, you can define your own method and then pass it to the apply () method.
It’s also possible to apply mathematical operations to columns in Pandas. This is done by assign the column to a mathematical operation. As an example, let’s calculate how many inches each person is tall. This is done by dividing the height in centimeters by 2.54: This returns the following:
Combine String Columns in Pandas. There may be many times when you want to combine different columns that contain strings. For example, the columns for First Name and Last Name can be combined to create a new column called “Name”. This can be done by writing the following: df['Name'] = df['First Name'] + ' ' + df['Last Name'] print(df)
If you data is not too big, you can use get_dummies
to encode the values and do a matrix multiplication:
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))
Output:
0 1 2 3
0 1.000000 0.666667 1.000000 1.000000
1 0.666667 1.000000 0.666667 0.666667
2 1.000000 0.666667 1.000000 1.000000
3 1.000000 0.666667 1.000000 1.000000
Update: Here's a short explanation for the code. The main idea is to turn the given lists into one-hot-encoded:
a b c d
0 1 1 1 0
1 0 1 1 1
2 1 1 1 0
3 1 1 1 0
Once we have that, the size of intersection of the two rows, say, 0
and 1
is just their dot product, because a character belongs to both rows if and only if it is represented by 1
in both.
With that in mind, first use
df.list_of_value.explode()
to turn each cell into a series and concatenate all of those series. Output:
0 a
0 b
0 c
1 d
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
Name: list_of_value, dtype: object
Now, we use pd.get_dummies
on that series to turn it to a one-hot-encoded dataframe:
a b c d
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
1 0 0 0 1
1 0 1 0 0
1 0 0 1 0
2 1 0 0 0
2 0 1 0 0
2 0 0 1 0
3 1 0 0 0
3 0 1 0 0
3 0 0 1 0
As you can see, each value has its own row. Since we want to combine those belong to the same original row to one row, we can just sum them by the original index. Thus
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
gives the binary-encoded dataframe we want. The next line
s.dot(s.T).div(s.sum(1))
is just as your logic: s.dot(s.T)
computes dot products by rows, then .div(s.sum(1))
divides counts by rows.
Try this
range_of_ids = range(len(ids))
def score_calculation(s_id1,s_id2):
s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0])
s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0])
# Resultant calculation s1&s2
return round(len(s1&s2)/len(s1) , 2)
dic = {indexQFID: [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids}
dfSim = pd.DataFrame(dic)
print(dfSim)
Output
0 1 2 3
0 1.00 0.67 1.00 1.00
1 0.67 1.00 0.67 0.67
2 1.00 0.67 1.00 1.00
3 1.00 0.67 1.00 1.00
You can also do it as following
dic = {indexQFID: [round(len(set(s1)&set(s2))/len(s1) , 2) for s2 in df['list_of_value']] for indexQFID,s1 in zip(df['id'],df['list_of_value']) }
dfSim = pd.DataFrame(dic)
print(dfSim)
Use nested list comprehension on the list of set s_list
. Within list comprehension, use intersection
operation to check overlapping and get length of each result. Finally, construct the dataframe and divide it by the length of each list in df.list_of_value
s_list = df.list_of_value.map(set)
overlap = [[len(s1 & s) for s1 in s_list] for s in s_list]
df_final = pd.DataFrame(overlap) / df.list_of_value.str.len().to_numpy()[:,None]
Out[76]:
0 1 2 3
0 1.000000 0.666667 1.000000 1.000000
1 0.666667 1.000000 0.666667 0.666667
2 1.000000 0.666667 1.000000 1.000000
3 1.000000 0.666667 1.000000 1.000000
In case there are duplicate values in each list, you should use collections.Counter
instead of set
. I changed sample data id=0 to ['a','a','c']
and id=1 to ['d','b','a']
sample df:
id list_of_value
0 ['a','a','c'] #changed
1 ['d','b','a'] #changed
2 ['a','b','c']
3 ['a','b','c']
from collections import Counter
c_list = df.list_of_value.map(Counter)
c_overlap = [[sum((c1 & c).values()) for c1 in c_list] for c in c_list]
df_final = pd.DataFrame(c_overlap) / df.list_of_value.str.len().to_numpy()[:,None]
Out[208]:
0 1 2 3
0 1.000000 0.333333 0.666667 0.666667
1 0.333333 1.000000 0.666667 0.666667
2 0.666667 0.666667 1.000000 1.000000
3 0.666667 0.666667 1.000000 1.000000
Updated
Since there are a lot of candidate solutions proposed, it seems like a good idea to do a timing analysis. I generated some random data with 12k rows as requested by the OP, keeping with the 3 elements per set but expanding the size of the alphabet available to populate the sets. This can be adjusted to match the actual data.
Let me know if you have a solution that you would like tested or updated.
Setup
import pandas as pd
import random
ALPHABET = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
def random_letters(n, n_letters=52):
return random.sample(ALPHABET[:n_letters], n)
# Create 12k rows to test scaling.
df = pd.DataFrame([{'id': i, 'list_of_value': random_letters(3)} for i in range(12000)])
Current Winner
def method_quang(df):
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
return s.dot(s.T).div(s.sum(1))
%time method_quang(df)
# CPU times: user 10.5 s, sys: 828 ms, total: 11.3 s
# Wall time: 11.3 s
# ...
# [12000 rows x 12000 columns]
Contenders
def method_mcskinner(df):
explode_df = df.set_index('id').list_of_value.explode().reset_index()
explode_df = explode_df.rename(columns={'list_of_value': 'value'})
denom_df = explode_df.groupby('id').size().reset_index(name='denom')
numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y'])
numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer')
calc_df = numer_df.merge(denom_df, on='id')
calc_df['score'] = calc_df['numer'] / calc_df['denom']
return calc_df.pivot('id', 'id_y', 'score').fillna(0)
%time method_mcskinner(df)
# CPU times: user 29.2 s, sys: 9.66 s, total: 38.9 s
# Wall time: 29.6 s
# ...
# [12000 rows x 12000 columns]
def method_rishab(df):
vals = [[len(set(val1) & set(val2)) / len(val1) for val2 in df['list_of_value']] for val1 in df['list_of_value']]
return pd.DataFrame(columns=df['id'], data=vals)
%time method_rishab(df)
# CPU times: user 2min 12s, sys: 4.64 s, total: 2min 17s
# Wall time: 2min 18s
# ...
# [12000 rows x 12000 columns]
def method_fahad(df):
ids = list(df['id'])
range_of_ids = range(len(ids))
def score_calculation(s_id1,s_id2):
s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0])
s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0])
# Resultant calculation s1&s2
return round(len(s1&s2)/len(s1) , 2)
dic = {indexQFID: [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids}
return pd.DataFrame(dic)
# Stopped manually after running for more than 10 minutes.
Original post with solution details
It is possible to do this in pandas
with a self-join.
As other answers have pointed out, the first step is to unpack the data into a longer form.
explode_df = df.set_index('id').list_of_value.explode().reset_index()
explode_df = explode_df.rename(columns={'list_of_value': 'value'})
explode_df
# id value
# 0 0 a
# 1 0 b
# 2 0 c
# 3 1 d
# 4 1 b
# ...
From this table it is possible to compute the per-ID counts.
denom_df = explode_df.groupby('id').size().reset_index(name='denom')
denom_df
# id denom
# 0 0 3
# 1 1 3
# 2 2 3
# 3 3 3
And then comes the self-join, which happens on value
column. This pairs IDs once for each intersecting value, so the paired IDs can be counted to get the intersection sizes.
numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y'])
numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer')
numer_df
# id id_y numer
# 0 0 0 3
# 1 0 1 2
# 2 0 2 3
# 3 0 3 3
# 4 1 0 2
# 5 1 1 3
# ...
These two can then be merged, and a score computed.
calc_df = numer_df.merge(denom_df, on='id')
calc_df['score'] = calc_df['numer'] / calc_df['denom']
calc_df
# id id_y numer denom score
# 0 0 0 3 3 1.000000
# 1 0 1 2 3 0.666667
# 2 0 2 3 3 1.000000
# 3 0 3 3 3 1.000000
# 4 1 0 2 3 0.666667
# 5 1 1 3 3 1.000000
# ...
If you prefer the matrix form, that is possible with a pivot
. This will be a much larger representation if the data is sparse.
calc_df.pivot('id', 'id_y', 'score').fillna(0)
# id_y 0 1 2 3
# id
# 0 1.000000 0.666667 1.000000 1.000000
# 1 0.666667 1.000000 0.666667 0.666667
# 2 1.000000 0.666667 1.000000 1.000000
# 3 1.000000 0.666667 1.000000 1.000000
You can conver the list to a set and use the intersection function to check for overlap:
(only 1 apply function is used as you asked :-) )
(
df.assign(s = df.list_of_value.apply(set))
.pipe(lambda x: pd.DataFrame([[len(e&f)/len(e) for f in x.s] for e in x.s]))
)
0 1 2 3
0 1.000000 0.666667 1.000000 1.000000
1 0.666667 1.000000 0.666667 0.666667
2 1.000000 0.666667 1.000000 1.000000
3 1.000000 0.666667 1.000000 1.000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With