how to get dataframe subsets having similar values on some columns?

Tags:

I'm working on the following dataframe:

   height  weight  shoe_size  age
0     175      70         40   30
1     175      75         39   41
2     175      69         40   33
3     176      71         40   35
4     178      81         41   27
5     169      73         38   49
6     170      65         39   30

and discovered this Relaxed Functional Dependency (RFD):

('weight': 2.0) ==> ('height': 1.0)

meaning that for each couple of rows having a difference <=2 on the weight they would have a difference <=1 on the height too.

I need to find all the subsets of rows on which this RFD holds, and show the one with more rows.

In this case the best (biggest) subset would be:

   height  weight  shoe_size  age
2     175      69         40   33
0     175      70         40   30
3     176      71         40   35

How do I get all such subsets from a dataframe or at least the biggest subset for which this RFD holds?

Update

I implemented the solution suggested by @DYZ, but it seems like the threshold is respected only for single edges and not for the full path from one node to the other of the connected component in the graph.

To better explain what I'm talking about, here an example of the subset found with the following RFD

('height': 1.0, 'age': 6.0) ==> ('weight': 4.0)

Subset

   height  weight  shoe_size  age
0     175      70         40   30
1     175      75         39   41
2     175      69         40   33
3     176      71         40   35

This subset is wrong because the rows

   height  weight  shoe_size  age
0     175      70         40   30

and

   height  weight  shoe_size  age
1     175      75         39   41

have a distance of 11 > 6, on the age, as well as a distance of 5 > 4 on the weight.

I think this may be because it's considered the adjacency matrix instead of a distance matrix for the graph, so a row is added if there is at least one edge respecting the threshold, but it should actually be there if all the paths in the connected components respect the thresholds.

770

asked Feb 16 '19 23:02

1Z10

1 Answers

There may be a better solution to your problem. I am going to show how to find the answer using graph theory (namely, module networkx).

import networkx as nx
import numpy as np

Start by isolating the columns that are involved in the RFD calculation:

values = df[['height', 'weight']].values

Calculate all possible differences. This is an O(N^2) operation and may be costly in terms of time and memory.

dist = np.abs(values[:,None] - values)

Identify the suitable pairs of rows:

im = (dist[:,:,0] <= 1) & (dist[:,:,1] <= 2)

Use them as an adjacency matrix and construct a graph. The graph nodes represent rows in the original dataframe. The nodes are connected if the corresponding rows are in the RFD:

G = nx.from_numpy_matrix(im)

Find all cliques in the graph (subgraphs where any node is connected directly to any other node). Sort the cliques by size and choose the largest. Extract the namesake rows from the original dataframe:

df.loc[sorted(nx.clique.find_cliques(G), key=len)[-1],:]
#   height  weight  shoe_size  age
#0     175      70         40   30
#2     175      69         40   33
#3     176      71         40   35

168

answered Nov 15 '22 06:11

DYZ

Related questions
                            
                                Is it possible to use a component inside another component with some modification
                            
                                "Bullets" shot from two Pygame sprites combine into a single long line
                            
                                How do I view syntax-highlighted source code for operators e.g. %in%?
                            
                                Groupby and find difference from min value of group : Pandas
                            
                                Select rows from a dataframe based on a pattern formed by consecutive values in a column
                            
                                Print the first letter of each word in a sentence
                            
                                How to simulate python 2 str.lower() in python 3
                            
                                D3.js Synchronized Zoom Across Multiple Graphs
                            
                                Playing MediaStream using AudioContext.createMediaStreamSource vs HTMLAudioElement.srcObject
                            
                                Vuetify.js - IE11 not showing datatables with babel-polyfill
                            
                                What is the best practice to deal with RxSwift retry and error handling
                            
                                Filtering and pagination is not working with ngxpagination template

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to get dataframe subsets having similar values on some columns?

Tags:

1Z10

People also ask

1 Answers

DYZ

Recent Activity

Donate For Us