Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching two people together based on attributes

I have a dataframe with different people. Each row contains attributes which characterize the individual person. Basically I need something like a filter or matching algorithm which weights specific attributes. The dataframe looks like this:

df= pd.DataFrame({
'sex' : [m,f,m,f,m,f],
'food' : [0,0,1,3,4,3],
 'age': [young, young, young, old, young, young]
'kitchen': [0,1,2,0,1,2],
})

The dataframe df looks like this:

    sex food  age     kitchen
0   m    0    young    0
1   f    0    young    1
2   m    1    young    2
3   f    3    old      0
4   m    4    young    1
5   f    3    young    2

I am looking for an algorithm which groups all people of the dataframe to pairs. My plan is to find pairs of two people based on the following attributes:

  1. One person must have a kitchen (kitchen=1)
    It is important that at least one person has a kitchen.

    kitchen=0 --> person has no kitchen

    kitchen=1 --> person has a kitchen

    kitchen=2 --> person has a kitchen but only in emergency (when there is no other option)

  2. Same food preferences

    food=0 --> meat eater

    food=1 --> does not matter

    food=2 --> vegan

    food=3 --> vegetarian

    A meat eater (food=0) can be matched with a person who doesn't care about food preferences (food=1) but can't be matched with a vegan or vegetarian. A vegan (food=2) fits best with a vegetarian (food=3) and, if necessary, can go with food=1. And so on...

  3. Similar age

    There are nine age groups: 10-18; 18-22; 22-26; 26-29, 29-34; 34-40; 40-45; 45-55 and 55-75. People in the same age group match perfectly. The young age groups with the older age groups do not match very well. Similar age groups match a little bit better. There is no clearly defined condition. The meaning of "old" and "young" is relative.

The sex doesn't matter. There are many pair combinations possible. Because my actual dataframe is very long (3000 rows), I need to find an automated solution. A solution that gives me the best pairs in a dataframe or dictionary or something else.

I really do not know how to approach this problem. I was looking for similar problems on Stack Overflow, but I did not find anything suitable. Mostly it was just too theoretically. Also I could not find anything that really fits my problem.

My expected output here would be, for example a dictionary (not sure how) or a dataframe which is sorted in a way that every two rows can be seen as one pair.

Background: The goal is to make pairs for some free time activities. Therefore I think, people in same or similar age groups share same interest, therefore I want to consider this fact in my code.

like image 913
PParker Avatar asked Jan 01 '19 14:01

PParker


3 Answers

I have done an addition by putting 'name' as a key to identify the person.

Approach

The approach is that I have scored the values which is further used to filter the final pairs according to the given conditions.

Scoring for Kitchen

For kitchen scores we used:

  • Person has no kitchen : 0
  • Person has a kitchen : 1
  • Person has kitchen but only in emergency : 0.5

if Condition Logic for kitchen

We check that if [kitchen score of record 1] + [kitchen score of record 2] is greater than Zero. As the following cases will be there:

  1. Both Members have no kitchen (sum will be 0) [EXCLUDED with > 0 Condition]
  2. Both Members have kitchen (sum will be 2)
  3. One Member have kitchen and other have no kitchen (sum will be 1)
  4. Both have emergency kitchen (sum will be 1)
  5. One have emergency kitchen and other have kitchen (sum will be 1.5)
  6. One Member have emergency kitchen and other have no kitchen (sum will be 0.5)

Scoring for Food

For food scores we used:

  • food = 0 --> meat eater : -1
  • food = 1 --> does not matter : 0
  • food = 2 --> vegan : 1
  • food = 3 --> vegetarian : 1

if Condition Logic for Food

We check if *[food score of record 1] * [food score of record 2]* is greater than or equal to Zero. As the following cases will be there:

  1. Both Members are Meat Eater : -1 x -1 = 1 [INCLUDED]
  2. One of the Member is Meat Eater and Other Vegan or Vegetarian : -1 x 1 = -1 [EXCLUDED]
  3. One of the Member is Meat Eater and Other Does Not Matter : -1 x 0 = 0 [INCLUDED]
  4. One of the Member is Vegan or Vegetarian and Other Does Not Matter : 1 x 0 = 0 [INCLUDED]
  5. Both of the Members are Either Vegan or Vegetarian : 1 x 1 = 1 [INCLUDED]

Scoring for Age Groups

For scoring age groups, we assigned some values to the groups as:

  • 10-18 : 1
  • 18-22 : 2
  • 22-26 : 3
  • 26-29 : 4
  • 29-34 : 5
  • 34-40 : 6
  • 40-45 : 7
  • 45-55 : 8
  • 55-75 : 9

Age Score Calculation

For calculating Age Score the following formula has been used: age_score = round((1 - (abs(Age Group Value Person 1 - Age Group Value of Person 2) / 10)), 2)

In the above formula we calculation has been done as follows:

  1. First we calculated the absolute value of the difference between the values of the age groups of the two persons.
  2. Then we divide it by 10 to normalize it.
  3. Further we subtracted this value from 1 to inverse the distance, so after this step we have higher value for persons in similar or closer age groups and lower value for persons in different or farther age groups.

Cases will be as:

  1. 18-22 and 18-22 : round(1 - (abs(2 - 2) / 10), 2) = 1.0
  2. 45-55 and 45-55 : round(1 - (abs(8 - 8) / 10), 2) = 1.0
  3. 18-22 and 45-55 : round(1 - (abs(2 - 8) / 10), 2) = 0.4
  4. 10-18 and 55-75 : round(1 - (abs(1 - 9) / 10), 2) = 0.2

Final Score Calculation

For calculating final Score we used:

Final Score = Food Score + Kitchen Score + Age Score

Then we have sorted the data on Final Score to obtain best Pairs.

Solution Code

import pandas as pd
import numpy as np

# Creating the DataFrame, here I have added the attribute 'name' for identifying the record.
df = pd.DataFrame({
    'name' : ['jacob', 'mary', 'rick', 'emily', 'sabastein', 'anna', 
              'christina', 'allen', 'jolly', 'rock', 'smith', 'waterman', 
              'mimi', 'katie', 'john', 'rose', 'leonardo', 'cinthy', 'jim', 
              'paul'],
    'sex' : ['m', 'f', 'm', 'f', 'm', 'f', 'f', 'm', 'f', 'm', 'm', 'm', 'f', 
             'f', 'm', 'f', 'm', 'f', 'm', 'm'],
    'food' : [0, 0, 1, 3, 2, 3, 1, 0, 0, 3, 3, 2, 1, 2, 1, 0, 1, 0, 3, 1],
    'age' : ['10-18', '22-26', '29-34', '40-45', '18-22', '34-40', '55-75',
             '45-55', '26-29', '26-29', '18-22', '55-75', '22-26', '45-55', 
             '10-18', '22-26', '40-45', '45-55', '10-18', '29-34'],
    'kitchen' : [0, 1, 2, 0, 1, 2, 2, 1, 0, 0, 1, 0, 1, 1, 1, 0, 2, 0, 2, 1],
})

# Adding a normalized field 'k_scr' for kitchen
df['k_scr'] = np.where((df['kitchen'] == 2), 0.5, df['kitchen'])

# Adding a normalized field 'f_scr' for food
df['f_scr'] = np.where((df['food'] == 1), 0, df['food'])
df['f_scr'] = np.where((df['food'] == 0), -1, df['f_scr'])
df['f_scr'] = np.where((df['food'] == 2), 1, df['f_scr'])
df['f_scr'] = np.where((df['food'] == 3), 1, df['f_scr'])

# Adding a normalized field 'a_scr' for age
df['a_scr'] = np.where((df['age'] == '10-18'), 1, df['age'])
df['a_scr'] = np.where((df['age'] == '18-22'), 2, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '22-26'), 3, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '26-29'), 4, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '29-34'), 5, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '34-40'), 6, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '40-45'), 7, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '45-55'), 8, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '55-75'), 9, df['a_scr'])

# Printing DataFrame after adding normalized score values
print(df)

commonarr = [] # Empty array for our output
dfarr = np.array(df) # Converting DataFrame to Numpy Array
for i in range(len(dfarr) - 1): # Iterating the Array row
    for j in range(i + 1, len(dfarr)): # Iterating the Array row + 1
        # Check for Food Condition to include relevant records
        if dfarr[i][6] * dfarr[j][6] >= 0: 
            # Check for Kitchen Condition to include relevant records
            if dfarr[i][5] + dfarr[j][5] > 0:
                row = []
                # Appending the names
                row.append(dfarr[i][0])
                row.append(dfarr[j][0])
                # Appending the final score
                row.append((dfarr[i][6] * dfarr[j][6]) +
                           (dfarr[i][5] + dfarr[j][5]) +
                           (round((1 - (abs(dfarr[i][7] -
                                            dfarr[j][7]) / 10)), 2)))

                # Appending the row to the Final Array
                commonarr.append(row)

# Converting Array to DataFrame
ndf = pd.DataFrame(commonarr)

# Sorting the DataFrame on Final Score
ndf = ndf.sort_values(by=[2], ascending=False)
print(ndf)

Input / Intermediate DataFrame with Scores

         name sex  food    age  kitchen  k_scr  f_scr a_scr
0       jacob   m     0  10-18        0    0.0     -1     1
1        mary   f     0  22-26        1    1.0     -1     3
2        rick   m     1  29-34        2    0.5      0     5
3       emily   f     3  40-45        0    0.0      1     7
4   sabastein   m     2  18-22        1    1.0      1     2
5        anna   f     3  34-40        2    0.5      1     6
6   christina   f     1  55-75        2    0.5      0     9
7       allen   m     0  45-55        1    1.0     -1     8
8       jolly   f     0  26-29        0    0.0     -1     4
9        rock   m     3  26-29        0    0.0      1     4
10      smith   m     3  18-22        1    1.0      1     2
11   waterman   m     2  55-75        0    0.0      1     9
12       mimi   f     1  22-26        1    1.0      0     3
13      katie   f     2  45-55        1    1.0      1     8
14       john   m     1  10-18        1    1.0      0     1
15       rose   f     0  22-26        0    0.0     -1     3
16   leonardo   m     1  40-45        2    0.5      0     7
17     cinthy   f     0  45-55        0    0.0     -1     8
18        jim   m     3  10-18        2    0.5      1     1
19       paul   m     1  29-34        1    1.0      0     5

Output

             0          1    2
48   sabastein      smith  4.0
10        mary      allen  3.5
51   sabastein      katie  3.4
102      smith        jim  3.4
54   sabastein        jim  3.4
99       smith      katie  3.4
61        anna      katie  3.3
45   sabastein       anna  3.1
58        anna      smith  3.1
14        mary       rose  3.0
12        mary       mimi  3.0
84       allen     cinthy  3.0
98       smith       mimi  2.9
105   waterman      katie  2.9
11        mary      jolly  2.9
50   sabastein       mimi  2.9
40       emily      katie  2.9
52   sabastein       john  2.9
100      smith       john  2.9
90        rock      smith  2.8
47   sabastein       rock  2.8
0        jacob       mary  2.8
17        mary       paul  2.8
13        mary       john  2.8
119      katie        jim  2.8
116       mimi       paul  2.8
111       mimi       john  2.8
103      smith       paul  2.7
85       allen       paul  2.7
120      katie       paul  2.7
..         ...        ...  ...

This solution has further scope of optimization.

like image 87
Anidhya Bhatnagar Avatar answered Oct 24 '22 18:10

Anidhya Bhatnagar


This seems like a very interesting problem to me. There are several ways to solve this problem. I will state you one, but will link you to another solution which I feel is somehow related.

A possible approach could be to create a additional column in your dataframe, including a 'code' which refers to the given attributes. For example:

    sex  food  age      kitchen   code
0   m    0     young    0         0y0
1   f    0     young    1         0y1
2   m    1     young    2         1y2
3   f    3     old      0         3o0
4   m    4     young    1         4y1
5   f    3     young    2         3y2

This 'code' is made up of shorts of your attributes. Since the sex doesn't matter, the first sign in the code stands for the 'food', the second one for the 'age' and the third for the 'kitchen'.

4y1 = food 4, age young, kitchen 1.

Based on these codes you can come up with a pattern. I recommend that you're working with Regular Expressions for this. You can then write something like this:

import re
haskitchen = r'(\S\S1)
hasnokitchen = r'(\S\S0)
df_dict = df.to_dict

match_kitchen = re.findall(haskitchen, df_dict)
match_nokitchen = re.dinfall(hasnokitchen, df_dict)

kitchendict["Has kitchen"] = [match_kitchen]
kitchendict["Has no kitchen"] = [match_notkitchen]

Based on this, you can loop over entries and put them together how you want. There may be a much easier solution and I didn't proof the code, but this just came up in my mind. One thing is for sure: Use regular expressions for matching.

like image 30
Mowgli Avatar answered Oct 24 '22 17:10

Mowgli


Well, let's test for the kitchen.

for I in(kitchen):
    if (I != 0):
        print("Kitchen Found)
    else:
        print("No kitchen")

Okay now that we have found a kitchen in the people who have a kitchen's houses, let's find the people without the kitchen someone with similar food preferences. Let's create a variable that tells us how many people have a kitchen(x). Let's also make the person variable for counting people.

people = 0
x = 0
for I in(kitchen):
    x = x + 1
    for A in (food):
            if (I != 0):
                x = x + 1
                print("Kitchen Found)
            else:
                print("No kitchen")
                for J in(food):
                    if(i == J):
                        print("food match found")
                    elif(A == 0):
                        if(J == 1):
                            print("food match found for person" + x)
                    elif(A == 2 or A == 3):
                        if(J == 2 or J == 3 or J == 1):
                            print("food match found for person" + x)

I am currently working on the age part adjusting somethings

like image 26
Dodge Avatar answered Oct 24 '22 17:10

Dodge