Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create a NxN matrix from one column pandas

i have dataframe with each row having a list value.

id     list_of_value
0      ['a','b','c']
1      ['d','b','c']
2      ['a','b','c']
3      ['a','b','c']

i have to do a calculate a score with one row and against all the other rows

For eg:

Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 , 
        resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id.size

repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids.

and create a N x N dataframe; such as this:

-  0  1    2  3
0  1  0.6  1  1
1  1  1    1  1 
2  1  1    1  1
3  1  1    1  1

Right now my code has just one for loop:

def scoreCalc(x,queryTData):
    #mathematical calculation
    commonTData = np.intersect1d(np.array(x),queryTData)
    return commonTData.size/queryTData.size

ids = list(df['feed_id'])
dfSim = pd.DataFrame()

for indexQFID in range(len(ids)):
    queryTData = np.array(df.loc[df['id'] == ids[indexQFID]]['list_of_value'].values.tolist())

    dfSim[segmentDfFeedIds[indexQFID]] = segmentDf['list_of_value'].apply(scoreCalc,args=(queryTData,))

Is there a better way to do this? can i just write one apply function instead doing a for-loop iteration. can i make it faster?

like image 876
Sriram Arvind Lakshmanakumar Avatar asked Apr 05 '20 12:04

Sriram Arvind Lakshmanakumar


People also ask

How to create columns in pandas?

In this post, you learned many different ways of creating columns in Pandas. This can be done by directly inserting data, applying mathematical operations to columns, and by working with strings. To learn more about string operations like split, check out the official documentation here.

How to apply a method over an existing column in pandas Dataframe?

If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas.DataFrame.apply () method should do the trick. For example, you can define your own method and then pass it to the apply () method.

How to apply mathematical operations to a column in pandas?

It’s also possible to apply mathematical operations to columns in Pandas. This is done by assign the column to a mathematical operation. As an example, let’s calculate how many inches each person is tall. This is done by dividing the height in centimeters by 2.54: This returns the following:

How do I combine two strings in a column in pandas?

Combine String Columns in Pandas. There may be many times when you want to combine different columns that contain strings. For example, the columns for First Name and Last Name can be combined to create a new column called “Name”. This can be done by writing the following: df['Name'] = df['First Name'] + ' ' + df['Last Name'] print(df)


Video Answer


5 Answers

If you data is not too big, you can use get_dummies to encode the values and do a matrix multiplication:

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))

Output:

          0         1         2         3
0  1.000000  0.666667  1.000000  1.000000
1  0.666667  1.000000  0.666667  0.666667
2  1.000000  0.666667  1.000000  1.000000
3  1.000000  0.666667  1.000000  1.000000

Update: Here's a short explanation for the code. The main idea is to turn the given lists into one-hot-encoded:

   a  b  c  d
0  1  1  1  0
1  0  1  1  1
2  1  1  1  0
3  1  1  1  0

Once we have that, the size of intersection of the two rows, say, 0 and 1 is just their dot product, because a character belongs to both rows if and only if it is represented by 1 in both.

With that in mind, first use

df.list_of_value.explode()

to turn each cell into a series and concatenate all of those series. Output:

0    a
0    b
0    c
1    d
1    b
1    c
2    a
2    b
2    c
3    a
3    b
3    c
Name: list_of_value, dtype: object

Now, we use pd.get_dummies on that series to turn it to a one-hot-encoded dataframe:

   a  b  c  d
0  1  0  0  0
0  0  1  0  0
0  0  0  1  0
1  0  0  0  1
1  0  1  0  0
1  0  0  1  0
2  1  0  0  0
2  0  1  0  0
2  0  0  1  0
3  1  0  0  0
3  0  1  0  0
3  0  0  1  0

As you can see, each value has its own row. Since we want to combine those belong to the same original row to one row, we can just sum them by the original index. Thus

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)

gives the binary-encoded dataframe we want. The next line

s.dot(s.T).div(s.sum(1))

is just as your logic: s.dot(s.T) computes dot products by rows, then .div(s.sum(1)) divides counts by rows.

like image 107
Quang Hoang Avatar answered Oct 22 '22 02:10

Quang Hoang


Try this

range_of_ids = range(len(ids))

def score_calculation(s_id1,s_id2):
    s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0])
    s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0])
    # Resultant calculation s1&s2
    return round(len(s1&s2)/len(s1) , 2)


dic = {indexQFID:  [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids}
dfSim = pd.DataFrame(dic)
print(dfSim)

Output

     0        1      2       3
0   1.00    0.67    1.00    1.00
1   0.67    1.00    0.67    0.67
2   1.00    0.67    1.00    1.00
3   1.00    0.67    1.00    1.00

You can also do it as following

dic = {indexQFID:  [round(len(set(s1)&set(s2))/len(s1) , 2) for s2 in df['list_of_value']] for indexQFID,s1 in zip(df['id'],df['list_of_value']) }
dfSim = pd.DataFrame(dic)
print(dfSim)
like image 4
FAHAD SIDDIQUI Avatar answered Oct 22 '22 01:10

FAHAD SIDDIQUI


Use nested list comprehension on the list of set s_list. Within list comprehension, use intersection operation to check overlapping and get length of each result. Finally, construct the dataframe and divide it by the length of each list in df.list_of_value

s_list =  df.list_of_value.map(set)
overlap = [[len(s1 & s) for s1 in s_list] for s in s_list]

df_final = pd.DataFrame(overlap) / df.list_of_value.str.len().to_numpy()[:,None]

Out[76]:
          0         1         2         3
0  1.000000  0.666667  1.000000  1.000000
1  0.666667  1.000000  0.666667  0.666667
2  1.000000  0.666667  1.000000  1.000000
3  1.000000  0.666667  1.000000  1.000000

In case there are duplicate values in each list, you should use collections.Counter instead of set. I changed sample data id=0 to ['a','a','c'] and id=1 to ['d','b','a']

sample df:
id     list_of_value
0      ['a','a','c'] #changed
1      ['d','b','a'] #changed
2      ['a','b','c']
3      ['a','b','c']

from collections import Counter

c_list =  df.list_of_value.map(Counter)
c_overlap = [[sum((c1 & c).values()) for c1 in c_list] for c in c_list]

df_final = pd.DataFrame(c_overlap) / df.list_of_value.str.len().to_numpy()[:,None]


 Out[208]:
          0         1         2         3
0  1.000000  0.333333  0.666667  0.666667
1  0.333333  1.000000  0.666667  0.666667
2  0.666667  0.666667  1.000000  1.000000
3  0.666667  0.666667  1.000000  1.000000
like image 3
Andy L. Avatar answered Oct 22 '22 03:10

Andy L.


Updated

Since there are a lot of candidate solutions proposed, it seems like a good idea to do a timing analysis. I generated some random data with 12k rows as requested by the OP, keeping with the 3 elements per set but expanding the size of the alphabet available to populate the sets. This can be adjusted to match the actual data.

Let me know if you have a solution that you would like tested or updated.

Setup

import pandas as pd
import random

ALPHABET = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

def random_letters(n, n_letters=52):
    return random.sample(ALPHABET[:n_letters], n)

# Create 12k rows to test scaling.
df = pd.DataFrame([{'id': i, 'list_of_value': random_letters(3)} for i in range(12000)])

Current Winner

def method_quang(df): 
    s = pd.get_dummies(df.list_of_value.explode()).sum(level=0) 
    return s.dot(s.T).div(s.sum(1)) 

%time method_quang(df)                                                                                                                                                                                                               
# CPU times: user 10.5 s, sys: 828 ms, total: 11.3 s
# Wall time: 11.3 s
# ...
# [12000 rows x 12000 columns]

Contenders

def method_mcskinner(df):
    explode_df = df.set_index('id').list_of_value.explode().reset_index() 
    explode_df = explode_df.rename(columns={'list_of_value': 'value'}) 
    denom_df = explode_df.groupby('id').size().reset_index(name='denom') 
    numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y']) 
    numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer') 
    calc_df = numer_df.merge(denom_df, on='id') 
    calc_df['score'] = calc_df['numer'] / calc_df['denom'] 
    return calc_df.pivot('id', 'id_y', 'score').fillna(0) 

%time method_mcskinner(df)
# CPU times: user 29.2 s, sys: 9.66 s, total: 38.9 s
# Wall time: 29.6 s
# ...
# [12000 rows x 12000 columns]
def method_rishab(df): 
    vals = [[len(set(val1) & set(val2)) / len(val1) for val2 in df['list_of_value']] for val1 in df['list_of_value']]
    return pd.DataFrame(columns=df['id'], data=vals)

%time method_rishab(df)                                                                                                                                                                                                              
# CPU times: user 2min 12s, sys: 4.64 s, total: 2min 17s
# Wall time: 2min 18s
# ...
# [12000 rows x 12000 columns]
def method_fahad(df): 
    ids = list(df['id']) 
    range_of_ids = range(len(ids)) 

    def score_calculation(s_id1,s_id2): 
        s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0]) 
        s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0]) 
        # Resultant calculation s1&s2 
        return round(len(s1&s2)/len(s1) , 2) 

    dic = {indexQFID:  [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids} 
    return pd.DataFrame(dic) 

# Stopped manually after running for more than 10 minutes.

Original post with solution details

It is possible to do this in pandas with a self-join.

As other answers have pointed out, the first step is to unpack the data into a longer form.

explode_df = df.set_index('id').list_of_value.explode().reset_index()
explode_df = explode_df.rename(columns={'list_of_value': 'value'})
explode_df
#     id value
# 0    0     a
# 1    0     b
# 2    0     c
# 3    1     d
# 4    1     b
# ...

From this table it is possible to compute the per-ID counts.

denom_df = explode_df.groupby('id').size().reset_index(name='denom')
denom_df
#    id  denom
# 0   0      3
# 1   1      3
# 2   2      3
# 3   3      3

And then comes the self-join, which happens on value column. This pairs IDs once for each intersecting value, so the paired IDs can be counted to get the intersection sizes.

numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y'])
numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer')
numer_df
#     id  id_y  numer
# 0    0     0      3
# 1    0     1      2
# 2    0     2      3
# 3    0     3      3
# 4    1     0      2
# 5    1     1      3
# ...

These two can then be merged, and a score computed.

calc_df = numer_df.merge(denom_df, on='id')
calc_df['score'] = calc_df['numer'] / calc_df['denom']
calc_df
#     id  id_y  numer  denom     score
# 0    0     0      3      3  1.000000
# 1    0     1      2      3  0.666667
# 2    0     2      3      3  1.000000
# 3    0     3      3      3  1.000000
# 4    1     0      2      3  0.666667
# 5    1     1      3      3  1.000000
# ...

If you prefer the matrix form, that is possible with a pivot. This will be a much larger representation if the data is sparse.

calc_df.pivot('id', 'id_y', 'score').fillna(0)
# id_y         0         1         2         3
# id                                          
# 0     1.000000  0.666667  1.000000  1.000000
# 1     0.666667  1.000000  0.666667  0.666667
# 2     1.000000  0.666667  1.000000  1.000000
# 3     1.000000  0.666667  1.000000  1.000000
like image 2
mcskinner Avatar answered Oct 22 '22 02:10

mcskinner


You can conver the list to a set and use the intersection function to check for overlap:

(only 1 apply function is used as you asked :-) )

(
    df.assign(s = df.list_of_value.apply(set))
    .pipe(lambda x: pd.DataFrame([[len(e&f)/len(e) for f in x.s] for e in x.s]))
)

    0           1           2           3
0   1.000000    0.666667    1.000000    1.000000
1   0.666667    1.000000    0.666667    0.666667
2   1.000000    0.666667    1.000000    1.000000
3   1.000000    0.666667    1.000000    1.000000
like image 1
Allen Avatar answered Oct 22 '22 02:10

Allen