Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get dataframe subsets having similar values on some columns?

Tags:

I'm working on the following dataframe:

   height  weight  shoe_size  age
0     175      70         40   30
1     175      75         39   41
2     175      69         40   33
3     176      71         40   35
4     178      81         41   27
5     169      73         38   49
6     170      65         39   30

and discovered this Relaxed Functional Dependency (RFD):

('weight': 2.0) ==> ('height': 1.0)

meaning that for each couple of rows having a difference <=2 on the weight they would have a difference <=1 on the height too.

I need to find all the subsets of rows on which this RFD holds, and show the one with more rows.

In this case the best (biggest) subset would be:

   height  weight  shoe_size  age
2     175      69         40   33
0     175      70         40   30
3     176      71         40   35

How do I get all such subsets from a dataframe or at least the biggest subset for which this RFD holds?

Update

I implemented the solution suggested by @DYZ, but it seems like the threshold is respected only for single edges and not for the full path from one node to the other of the connected component in the graph.

To better explain what I'm talking about, here an example of the subset found with the following RFD

('height': 1.0, 'age': 6.0) ==> ('weight': 4.0)

Subset

   height  weight  shoe_size  age
0     175      70         40   30
1     175      75         39   41
2     175      69         40   33
3     176      71         40   35

This subset is wrong because the rows

   height  weight  shoe_size  age
0     175      70         40   30

and

   height  weight  shoe_size  age
1     175      75         39   41

have a distance of 11 > 6, on the age, as well as a distance of 5 > 4 on the weight.

I think this may be because it's considered the adjacency matrix instead of a distance matrix for the graph, so a row is added if there is at least one edge respecting the threshold, but it should actually be there if all the paths in the connected components respect the thresholds.

like image 770
1Z10 Avatar asked Feb 16 '19 23:02

1Z10


People also ask

How do I compare two column values in a data frame?

By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.


1 Answers

There may be a better solution to your problem. I am going to show how to find the answer using graph theory (namely, module networkx).

import networkx as nx
import numpy as np

Start by isolating the columns that are involved in the RFD calculation:

values = df[['height', 'weight']].values

Calculate all possible differences. This is an O(N^2) operation and may be costly in terms of time and memory.

dist = np.abs(values[:,None] - values)

Identify the suitable pairs of rows:

im = (dist[:,:,0] <= 1) & (dist[:,:,1] <= 2)

Use them as an adjacency matrix and construct a graph. The graph nodes represent rows in the original dataframe. The nodes are connected if the corresponding rows are in the RFD:

G = nx.from_numpy_matrix(im)

Find all cliques in the graph (subgraphs where any node is connected directly to any other node). Sort the cliques by size and choose the largest. Extract the namesake rows from the original dataframe:

df.loc[sorted(nx.clique.find_cliques(G), key=len)[-1],:]
#   height  weight  shoe_size  age
#0     175      70         40   30
#2     175      69         40   33
#3     176      71         40   35
like image 168
DYZ Avatar answered Nov 15 '22 06:11

DYZ