I'm working on the following dataframe:
height weight shoe_size age
0 175 70 40 30
1 175 75 39 41
2 175 69 40 33
3 176 71 40 35
4 178 81 41 27
5 169 73 38 49
6 170 65 39 30
and discovered this Relaxed Functional Dependency (RFD):
('weight': 2.0) ==> ('height': 1.0)
meaning that for each couple of rows having a difference <=2 on the weight they would have a difference <=1 on the height too.
I need to find all the subsets of rows on which this RFD holds, and show the one with more rows.
In this case the best (biggest) subset would be:
height weight shoe_size age
2 175 69 40 33
0 175 70 40 30
3 176 71 40 35
How do I get all such subsets from a dataframe or at least the biggest subset for which this RFD holds?
Update
I implemented the solution suggested by @DYZ, but it seems like the threshold is respected only for single edges and not for the full path from one node to the other of the connected component in the graph.
To better explain what I'm talking about, here an example of the subset found with the following RFD
('height': 1.0, 'age': 6.0) ==> ('weight': 4.0)
Subset
height weight shoe_size age
0 175 70 40 30
1 175 75 39 41
2 175 69 40 33
3 176 71 40 35
This subset is wrong because the rows
height weight shoe_size age
0 175 70 40 30
and
height weight shoe_size age
1 175 75 39 41
have a distance of 11 > 6, on the age, as well as a distance of 5 > 4 on the weight.
I think this may be because it's considered the adjacency matrix instead of a distance matrix for the graph, so a row is added if there is at least one edge respecting the threshold, but it should actually be there if all the paths in the connected components respect the thresholds.
By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.
There may be a better solution to your problem. I am going to show how to find the answer using graph theory (namely, module networkx
).
import networkx as nx
import numpy as np
Start by isolating the columns that are involved in the RFD calculation:
values = df[['height', 'weight']].values
Calculate all possible differences. This is an O(N^2)
operation and may be costly in terms of time and memory.
dist = np.abs(values[:,None] - values)
Identify the suitable pairs of rows:
im = (dist[:,:,0] <= 1) & (dist[:,:,1] <= 2)
Use them as an adjacency matrix and construct a graph. The graph nodes represent rows in the original dataframe. The nodes are connected if the corresponding rows are in the RFD:
G = nx.from_numpy_matrix(im)
Find all cliques in the graph (subgraphs where any node is connected directly to any other node). Sort the cliques by size and choose the largest. Extract the namesake rows from the original dataframe:
df.loc[sorted(nx.clique.find_cliques(G), key=len)[-1],:]
# height weight shoe_size age
#0 175 70 40 30
#2 175 69 40 33
#3 176 71 40 35
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With