i have dataframe with each row having a list value. <pre class="prettyprint"><code>id list_of_value 0 ['a','b','c'] 1 ['d','b','c'] 2 ['a','b','c'] 3 ['a','b','c'] </code></pre> i have to do a calculate a score with one row and against all the other rows For eg: <pre class="prettyprint"><code>Step 1: Take value of id 0: ['a','b','c'], Step 2: find the intersection between id 0 and id 1 , resultant = ['b','c'] Step 3: Score Calculation => resultant.size / id.size </code></pre> repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids. and create a N x N dataframe; such as this: <pre class="prettyprint"><code>- 0 1 2 3 0 1 0.6 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 </code></pre> Right now my code has just one for loop: <pre class="prettyprint"><code>def scoreCalc(x,queryTData): #mathematical calculation commonTData = np.intersect1d(np.array(x),queryTData) return commonTData.size/queryTData.size ids = list(df['feed_id']) dfSim = pd.DataFrame() for indexQFID in range(len(ids)): queryTData = np.array(df.loc[df['id'] == ids[indexQFID]]['list_of_value'].values.tolist()) dfSim[segmentDfFeedIds[indexQFID]] = segmentDf['list_of_value'].apply(scoreCalc,args=(queryTData,)) </code></pre> Is there a better way to do this? can i just write one apply function instead doing a for-loop iteration. can i make it faster?

If you data is not too big, you can use <code>get_dummies</code> to encode the values and do a matrix multiplication: <pre class="prettyprint"><code>s = pd.get_dummies(df.list_of_value.explode()).sum(level=0) s.dot(s.T).div(s.sum(1)) </code></pre> Output: <pre class="prettyprint"><code> 0 1 2 3 0 1.000000 0.666667 1.000000 1.000000 1 0.666667 1.000000 0.666667 0.666667 2 1.000000 0.666667 1.000000 1.000000 3 1.000000 0.666667 1.000000 1.000000 </code></pre> <hr> Update: Here's a short explanation for the code. The main idea is to turn the given lists into one-hot-encoded: <pre class="prettyprint"><code> a b c d 0 1 1 1 0 1 0 1 1 1 2 1 1 1 0 3 1 1 1 0 </code></pre> Once we have that, the size of intersection of the two rows, say, <code>0</code> and <code>1</code> is just their dot product, because a character belongs to both rows if and only if it is represented by <code>1</code> in both. With that in mind, first use <pre class="prettyprint"><code>df.list_of_value.explode() </code></pre> to turn each cell into a series and concatenate all of those series. Output: <pre class="prettyprint"><code>0 a 0 b 0 c 1 d 1 b 1 c 2 a 2 b 2 c 3 a 3 b 3 c Name: list_of_value, dtype: object </code></pre> Now, we use <code>pd.get_dummies</code> on that series to turn it to a one-hot-encoded dataframe: <pre class="prettyprint"><code> a b c d 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 2 1 0 0 0 2 0 1 0 0 2 0 0 1 0 3 1 0 0 0 3 0 1 0 0 3 0 0 1 0 </code></pre> As you can see, each value has its own row. Since we want to combine those belong to the same original row to one row, we can just sum them by the original index. Thus <pre class="prettyprint"><code>s = pd.get_dummies(df.list_of_value.explode()).sum(level=0) </code></pre> gives the binary-encoded dataframe we want. The next line <pre class="prettyprint"><code>s.dot(s.T).div(s.sum(1)) </code></pre> is just as your logic: <code>s.dot(s.T)</code> computes dot products by rows, then <code>.div(s.sum(1))</code> divides counts by rows.

Try this <pre class="prettyprint"><code>range_of_ids = range(len(ids)) def score_calculation(s_id1,s_id2): s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0]) s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0]) # Resultant calculation s1&s2 return round(len(s1&s2)/len(s1) , 2) dic = {indexQFID: [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids} dfSim = pd.DataFrame(dic) print(dfSim) </code></pre> Output <pre class="prettyprint"><code> 0 1 2 3 0 1.00 0.67 1.00 1.00 1 0.67 1.00 0.67 0.67 2 1.00 0.67 1.00 1.00 3 1.00 0.67 1.00 1.00 </code></pre> You can also do it as following <pre class="prettyprint"><code>dic = {indexQFID: [round(len(set(s1)&set(s2))/len(s1) , 2) for s2 in df['list_of_value']] for indexQFID,s1 in zip(df['id'],df['list_of_value']) } dfSim = pd.DataFrame(dic) print(dfSim) </code></pre>

Use nested list comprehension on the list of set <code>s_list</code>. Within list comprehension, use <code>intersection</code> operation to check overlapping and get length of each result. Finally, construct the dataframe and divide it by the length of each list in <code>df.list_of_value</code> <pre class="prettyprint"><code>s_list = df.list_of_value.map(set) overlap = [[len(s1 & s) for s1 in s_list] for s in s_list] df_final = pd.DataFrame(overlap) / df.list_of_value.str.len().to_numpy()[:,None] Out[76]: 0 1 2 3 0 1.000000 0.666667 1.000000 1.000000 1 0.666667 1.000000 0.666667 0.666667 2 1.000000 0.666667 1.000000 1.000000 3 1.000000 0.666667 1.000000 1.000000 </code></pre> <hr> In case there are duplicate values in each list, you should use <code>collections.Counter</code> instead of <code>set</code>. I changed sample data id=0 to <code>['a','a','c']</code> and id=1 to <code>['d','b','a']</code> <pre class="prettyprint"><code>sample df: id list_of_value 0 ['a','a','c'] #changed 1 ['d','b','a'] #changed 2 ['a','b','c'] 3 ['a','b','c'] from collections import Counter c_list = df.list_of_value.map(Counter) c_overlap = [[sum((c1 & c).values()) for c1 in c_list] for c in c_list] df_final = pd.DataFrame(c_overlap) / df.list_of_value.str.len().to_numpy()[:,None] Out[208]: 0 1 2 3 0 1.000000 0.333333 0.666667 0.666667 1 0.333333 1.000000 0.666667 0.666667 2 0.666667 0.666667 1.000000 1.000000 3 0.666667 0.666667 1.000000 1.000000 </code></pre>

You can conver the list to a set and use the intersection function to check for overlap: (only 1 apply function is used as you asked :-) ) <pre class="prettyprint"><code>( df.assign(s = df.list_of_value.apply(set)) .pipe(lambda x: pd.DataFrame([[len(e&f)/len(e) for f in x.s] for e in x.s])) ) 0 1 2 3 0 1.000000 0.666667 1.000000 1.000000 1 0.666667 1.000000 0.666667 0.666667 2 1.000000 0.666667 1.000000 1.000000 3 1.000000 0.666667 1.000000 1.000000 </code></pre>

create a NxN matrix from one column pandas

Tags:

python

pandas

numpy

i have dataframe with each row having a list value.

id     list_of_value
0      ['a','b','c']
1      ['d','b','c']
2      ['a','b','c']
3      ['a','b','c']

i have to do a calculate a score with one row and against all the other rows

For eg:

Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 , 
        resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id.size

repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids.

and create a N x N dataframe; such as this:

-  0  1    2  3
0  1  0.6  1  1
1  1  1    1  1 
2  1  1    1  1
3  1  1    1  1

Right now my code has just one for loop:

def scoreCalc(x,queryTData):
    #mathematical calculation
    commonTData = np.intersect1d(np.array(x),queryTData)
    return commonTData.size/queryTData.size

ids = list(df['feed_id'])
dfSim = pd.DataFrame()

for indexQFID in range(len(ids)):
    queryTData = np.array(df.loc[df['id'] == ids[indexQFID]]['list_of_value'].values.tolist())

    dfSim[segmentDfFeedIds[indexQFID]] = segmentDf['list_of_value'].apply(scoreCalc,args=(queryTData,))

Is there a better way to do this? can i just write one apply function instead doing a for-loop iteration. can i make it faster?

876

asked Apr 05 '20 12:04

Sriram Arvind Lakshmanakumar

Video Answer

5 Answers

If you data is not too big, you can use get_dummies to encode the values and do a matrix multiplication:

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))

Output:

          0         1         2         3
0  1.000000  0.666667  1.000000  1.000000
1  0.666667  1.000000  0.666667  0.666667
2  1.000000  0.666667  1.000000  1.000000
3  1.000000  0.666667  1.000000  1.000000

Update: Here's a short explanation for the code. The main idea is to turn the given lists into one-hot-encoded:

   a  b  c  d
0  1  1  1  0
1  0  1  1  1
2  1  1  1  0
3  1  1  1  0

Once we have that, the size of intersection of the two rows, say, 0 and 1 is just their dot product, because a character belongs to both rows if and only if it is represented by 1 in both.

With that in mind, first use

df.list_of_value.explode()

to turn each cell into a series and concatenate all of those series. Output:

0    a
0    b
0    c
1    d
1    b
1    c
2    a
2    b
2    c
3    a
3    b
3    c
Name: list_of_value, dtype: object

Now, we use pd.get_dummies on that series to turn it to a one-hot-encoded dataframe:

   a  b  c  d
0  1  0  0  0
0  0  1  0  0
0  0  0  1  0
1  0  0  0  1
1  0  1  0  0
1  0  0  1  0
2  1  0  0  0
2  0  1  0  0
2  0  0  1  0
3  1  0  0  0
3  0  1  0  0
3  0  0  1  0

As you can see, each value has its own row. Since we want to combine those belong to the same original row to one row, we can just sum them by the original index. Thus

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)

gives the binary-encoded dataframe we want. The next line

s.dot(s.T).div(s.sum(1))

is just as your logic: s.dot(s.T) computes dot products by rows, then .div(s.sum(1)) divides counts by rows.

107

answered Oct 22 '22 02:10

Quang Hoang

Try this

range_of_ids = range(len(ids))

def score_calculation(s_id1,s_id2):
    s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0])
    s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0])
    # Resultant calculation s1&s2
    return round(len(s1&s2)/len(s1) , 2)


dic = {indexQFID:  [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids}
dfSim = pd.DataFrame(dic)
print(dfSim)

Output

     0        1      2       3
0   1.00    0.67    1.00    1.00
1   0.67    1.00    0.67    0.67
2   1.00    0.67    1.00    1.00
3   1.00    0.67    1.00    1.00

You can also do it as following

dic = {indexQFID:  [round(len(set(s1)&set(s2))/len(s1) , 2) for s2 in df['list_of_value']] for indexQFID,s1 in zip(df['id'],df['list_of_value']) }
dfSim = pd.DataFrame(dic)
print(dfSim)

answered Oct 22 '22 01:10

FAHAD SIDDIQUI

Use nested list comprehension on the list of set s_list. Within list comprehension, use intersection operation to check overlapping and get length of each result. Finally, construct the dataframe and divide it by the length of each list in df.list_of_value

s_list =  df.list_of_value.map(set)
overlap = [[len(s1 & s) for s1 in s_list] for s in s_list]

df_final = pd.DataFrame(overlap) / df.list_of_value.str.len().to_numpy()[:,None]

Out[76]:
          0         1         2         3
0  1.000000  0.666667  1.000000  1.000000
1  0.666667  1.000000  0.666667  0.666667
2  1.000000  0.666667  1.000000  1.000000
3  1.000000  0.666667  1.000000  1.000000

In case there are duplicate values in each list, you should use collections.Counter instead of set. I changed sample data id=0 to ['a','a','c'] and id=1 to ['d','b','a']

sample df:
id     list_of_value
0      ['a','a','c'] #changed
1      ['d','b','a'] #changed
2      ['a','b','c']
3      ['a','b','c']

from collections import Counter

c_list =  df.list_of_value.map(Counter)
c_overlap = [[sum((c1 & c).values()) for c1 in c_list] for c in c_list]

df_final = pd.DataFrame(c_overlap) / df.list_of_value.str.len().to_numpy()[:,None]


 Out[208]:
          0         1         2         3
0  1.000000  0.333333  0.666667  0.666667
1  0.333333  1.000000  0.666667  0.666667
2  0.666667  0.666667  1.000000  1.000000
3  0.666667  0.666667  1.000000  1.000000

answered Oct 22 '22 03:10

Andy L.

Updated

Since there are a lot of candidate solutions proposed, it seems like a good idea to do a timing analysis. I generated some random data with 12k rows as requested by the OP, keeping with the 3 elements per set but expanding the size of the alphabet available to populate the sets. This can be adjusted to match the actual data.

Let me know if you have a solution that you would like tested or updated.

Setup

import pandas as pd
import random

ALPHABET = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

def random_letters(n, n_letters=52):
    return random.sample(ALPHABET[:n_letters], n)

# Create 12k rows to test scaling.
df = pd.DataFrame([{'id': i, 'list_of_value': random_letters(3)} for i in range(12000)])

Current Winner

def method_quang(df): 
    s = pd.get_dummies(df.list_of_value.explode()).sum(level=0) 
    return s.dot(s.T).div(s.sum(1)) 

%time method_quang(df)                                                                                                                                                                                                               
# CPU times: user 10.5 s, sys: 828 ms, total: 11.3 s
# Wall time: 11.3 s
# ...
# [12000 rows x 12000 columns]

Contenders

def method_mcskinner(df):
    explode_df = df.set_index('id').list_of_value.explode().reset_index() 
    explode_df = explode_df.rename(columns={'list_of_value': 'value'}) 
    denom_df = explode_df.groupby('id').size().reset_index(name='denom') 
    numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y']) 
    numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer') 
    calc_df = numer_df.merge(denom_df, on='id') 
    calc_df['score'] = calc_df['numer'] / calc_df['denom'] 
    return calc_df.pivot('id', 'id_y', 'score').fillna(0) 

%time method_mcskinner(df)
# CPU times: user 29.2 s, sys: 9.66 s, total: 38.9 s
# Wall time: 29.6 s
# ...
# [12000 rows x 12000 columns]

def method_rishab(df): 
    vals = [[len(set(val1) & set(val2)) / len(val1) for val2 in df['list_of_value']] for val1 in df['list_of_value']]
    return pd.DataFrame(columns=df['id'], data=vals)

%time method_rishab(df)                                                                                                                                                                                                              
# CPU times: user 2min 12s, sys: 4.64 s, total: 2min 17s
# Wall time: 2min 18s
# ...
# [12000 rows x 12000 columns]

def method_fahad(df): 
    ids = list(df['id']) 
    range_of_ids = range(len(ids)) 

    def score_calculation(s_id1,s_id2): 
        s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0]) 
        s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0]) 
        # Resultant calculation s1&s2 
        return round(len(s1&s2)/len(s1) , 2) 

    dic = {indexQFID:  [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids} 
    return pd.DataFrame(dic) 

# Stopped manually after running for more than 10 minutes.

Original post with solution details

It is possible to do this in pandas with a self-join.

As other answers have pointed out, the first step is to unpack the data into a longer form.

explode_df = df.set_index('id').list_of_value.explode().reset_index()
explode_df = explode_df.rename(columns={'list_of_value': 'value'})
explode_df
#     id value
# 0    0     a
# 1    0     b
# 2    0     c
# 3    1     d
# 4    1     b
# ...

From this table it is possible to compute the per-ID counts.

denom_df = explode_df.groupby('id').size().reset_index(name='denom')
denom_df
#    id  denom
# 0   0      3
# 1   1      3
# 2   2      3
# 3   3      3

And then comes the self-join, which happens on value column. This pairs IDs once for each intersecting value, so the paired IDs can be counted to get the intersection sizes.

numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y'])
numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer')
numer_df
#     id  id_y  numer
# 0    0     0      3
# 1    0     1      2
# 2    0     2      3
# 3    0     3      3
# 4    1     0      2
# 5    1     1      3
# ...

These two can then be merged, and a score computed.

calc_df = numer_df.merge(denom_df, on='id')
calc_df['score'] = calc_df['numer'] / calc_df['denom']
calc_df
#     id  id_y  numer  denom     score
# 0    0     0      3      3  1.000000
# 1    0     1      2      3  0.666667
# 2    0     2      3      3  1.000000
# 3    0     3      3      3  1.000000
# 4    1     0      2      3  0.666667
# 5    1     1      3      3  1.000000
# ...

If you prefer the matrix form, that is possible with a pivot. This will be a much larger representation if the data is sparse.

calc_df.pivot('id', 'id_y', 'score').fillna(0)
# id_y         0         1         2         3
# id                                          
# 0     1.000000  0.666667  1.000000  1.000000
# 1     0.666667  1.000000  0.666667  0.666667
# 2     1.000000  0.666667  1.000000  1.000000
# 3     1.000000  0.666667  1.000000  1.000000

answered Oct 22 '22 02:10

mcskinner

You can conver the list to a set and use the intersection function to check for overlap:

(only 1 apply function is used as you asked :-) )

(
    df.assign(s = df.list_of_value.apply(set))
    .pipe(lambda x: pd.DataFrame([[len(e&f)/len(e) for f in x.s] for e in x.s]))
)

    0           1           2           3
0   1.000000    0.666667    1.000000    1.000000
1   0.666667    1.000000    0.666667    0.666667
2   1.000000    0.666667    1.000000    1.000000
3   1.000000    0.666667    1.000000    1.000000

answered Oct 22 '22 02:10

Allen

Related questions
                            
                                Python: data argument can't be an iterator
                            
                                How to normalize json correctly by Python Pandas
                            
                                How to display image stored in pandas dataframe?
                            
                                Correct way to select a specific column from the last row of a Dataframe
                            
                                Split a list with n*n elements into n lists with n elements in every list [duplicate]
                            
                                Dashed lines from points to axes in matplotlib
                            
                                Pylint complains about comparing a string to a literal with 'is' [duplicate]
                            
                                ValueError: Cannot index with multidimensional key
                            
                                Create a .obj file from 3d array in python
                            
                                How do I drop duplicates and keep the last timestamp on pandas
                            
                                Why is `function` not a keyword in Python?
                            
                                Sqlalchemy - rollback when exception
                            
                                How to merge Flask login with a Dash application?
                            
                                Python IOError: [Errno 5] Input/output error?
                            
                                How to Clear Python Output Programatically in Google Colaboratory?
                            
                                TypeError: can only concatenate str (not "float") to str
                            
                                Combine lists with common elements
                            
                                Catching exceptions raised in QApplication
                            
                                Unique combinations of a list of tuples
                            
                                How is `var[:] = []` different from `var = []`? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

create a NxN matrix from one column pandas

Tags:

python

pandas

numpy

Sriram Arvind Lakshmanakumar

People also ask

Video Answer

5 Answers

Quang Hoang

FAHAD SIDDIQUI

Andy L.

mcskinner

Allen

Recent Activity

Donate For Us