Filtering non-'cohorts' from dataset

Tags:

I am sure this topic has been researched before, I am not sure what it is called or what techniques I should also look into, hence why I am here. I am running this mainly in Python and Pandas but it is not limited to those languages/technologies.

As an example, let's pretend I have this dataset:

| PID | A    | B    | C    |
| --- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 |
| 400 | 0.97 | 0.61 | 0.30 |
| 251 | 0.01 | 0.97 | 0.29 |
| 414 | 0.25 | 0.04 | 0.83 |
| 706 | 0.37 | 0.32 | 0.33 |
| 65  | 0.78 | 0.62 | 0.25 |
| 533 | 0.24 | 0.15 | 0.88 |

PID is a unique ID to that row. A, B and C are some factors (normalized for this example). This datasets could be players in a sports league over history, it could be products in an inventory, it could be voter data. The specific context isn't important.

Now let's say I have some input data:

| A    | B    | C    |
| ---- | ---- | ---- |
| 0.81 | 0.75 | 0.17 |

This input shares the same factors as the original dataset (A, B, C). What I want to do is to find the rows that are similar to my input data (the "cohorts"). What is the best way to approach this?

I thought of clustering, using a kNN algorithm, but the problem is that the number of cohorts is not set. You could have unique input and have few/no "cohorts", or you could have input that is very common and have hundreds of "cohorts".

The solution I next tried was Euclidean Distance. So for this dataset and input I would do something like:

my_cols = ['A', 'B', 'C']

inputdata = pd.Series([0.81, 0.75, 0.17], index=['A', 'B', 'C'])

# df = pandas data frame with above data

df['Dict'] = (df[my_cols] - inputdata).pow(2).sum(1).pow(0.5)

This would create a new column on the dataset like:

| PID | A    | B    | C    | Dist |
| --- | ---- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 | 0.27 |
| 400 | 0.97 | 0.61 | 0.30 | 0.25 |
| 251 | 0.01 | 0.97 | 0.29 | 0.84 |
| 414 | 0.25 | 0.04 | 0.83 | 1.12 |
| 706 | 0.37 | 0.32 | 0.33 | 0.63 |
| 65  | 0.78 | 0.62 | 0.25 | 0.16 |
| 533 | 0.24 | 0.15 | 0.88 | 1.09 |

You can then "filter" out those rows below some threshold.

cohorts = df[df['Dist'] <= THRESHOLD]

The issue then becomes (1) How do you determine that best threshold? and (2) If I add a 4th factor ("D") into the dataset and Euclid calculation, it seems to "break" the results, in that the cohorts no longer make intuitive sense, looking at the results.

So my question is: what are techniques or better ways to filter/select "cohorts" (those rows similar to an input row) ?

Thank you

663

asked Oct 02 '20 15:10

Reily Bourne

4 Answers

Here is an algorithm I came up with myself by logical thinking and some basic statistics. It uses the mean of the values and the mean of your input data to find the closest matches based on the standard deviation using pd.merge_asof:

factors = ['A', 'B', 'C']
df = df.assign(avg=df[factors].mean(axis=1)).sort_values('avg')
input_data = input_data.assign(avg=input_data[factors].mean(axis=1)).sort_values('avg')

dfn = pd.merge_asof(
    df,
    input_data,
    on='avg',
    direction='nearest',
    tolerance=df['avg'].std()
)

   PID   A_x   B_x   C_x       avg   A_y   B_y   C_y
0  706  0.37  0.32  0.33  0.340000   NaN   NaN   NaN
1  414  0.25  0.04  0.83  0.373333   NaN   NaN   NaN
2  251  0.01  0.97  0.29  0.423333   NaN   NaN   NaN
3  533  0.24  0.15  0.88  0.423333   NaN   NaN   NaN
4  508  0.85  0.51  0.05  0.470000   NaN   NaN   NaN
5   65  0.78  0.62  0.25  0.550000  0.81  0.75  0.17
6  400  0.97  0.61  0.30  0.626667  0.81  0.75  0.17

170

answered Oct 16 '22 07:10

Erfan

You're facing a clustering problem so your K-means intuition was right.

clustering

But, as you mentioned K-means is a parametric approach, so you need to determine the right K. There is an automated way of finding the best K regarding cluster quality (shape, stability, homogeneity) which is named the elbow method: https://www.scikit-yb.org/en/latest/api/cluster/elbow.html

Then, you can use another clustering approach (in fact the right clustering algorithm depends of the meaning of your features), for example you can use a density based approach with DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html).

Thus, you'll need to identify the best clustering algorithm depending on your problem: https://machinelearningmastery.com/clustering-algorithms-with-python/

With this solution you'll fit your clustering algorithm on your training set (the one you name "cohort" set), and the use the model to predict the cluster on your "non-cohort" samples.

statistical cohorts

In some fields, like the marketing field, you'll also find methods to create clusters (cohorts) based on some numerical attributes and using descriptive statistics.

The best example is the RFM segmentation method, which is a really smart way of doing clustering while keeping high intelligibility for resulting clusters: https://towardsdatascience.com/know-your-customers-with-rfm-9f88f09433bc

Using this approach you'll build your features on your entire set of data, and then get the resulting segments depending on feature values.

answered Oct 16 '22 08:10

fpajot

As you do not really know the number of clusters (cohorts) or their structure, I believe that OPTICS algorithm would suit you best. It finds a group of points that are packed together (using Euclidian distance), and expands to build a cluster from them. Then it is easy to find the cluster a new point belongs (or not) to. It is similar to the DBSCAN, but does not make assumption of similar density among the clusters. sklearn library includes the implementation of the OPTICS.

answered Oct 16 '22 07:10

igrinis

My understanding is that you want the distance from each column to be independently accounted for, but collected together in the final result.

To get that independent accounting, you can find out a measure of how different the members in a column are by using its standard deviation σ (whimsical set of explanations).

To collect together the final result, you can filter your dataframe iteratively, removing rows which are outside the wanted range. This also successively reduces the processing time, though it'll be negligible unless you have a great deal of data.

If adding your fourth column causes no data to be sufficiently close, this could indicate

your test data is really not close to any of the source data and is a unique entry
your data is not normally distributed (if more data is available, you can test this with scikit.stats.normaltest)
your columns are not independent (ie. need more specialized statistical handling)

If the second or third is the case, you should not use the normal standard deviation, but one from from another distribution (list and more tests)

However, if your data is seemingly random, you can apply some factor and/or power (ie. the variance) of the standard deviation in each column to get more or less-accurate results.

initial dataframes

starting data (df)

PID       A     B     C
508.0  0.85  0.51  0.05
400.0  0.97  0.61   0.3
251.0  0.01  0.97  0.29
414.0  0.25  0.04  0.83
706.0  0.37  0.32  0.33
65.0   0.78  0.62  0.25
533.0  0.24  0.15  0.88

test data (test_data)

      A     B     C
0  0.81  0.75  0.17

df.std()

find the standard deviation of each column and collect it into a new dataframe

then assemble another dataframe with this

stdv = df.std()

PID
A    0.367145
B    0.316965
C    0.312219

test_df = pd.DataFrame()
test_df = test_df.append(test_data - stdv)
test_df = test_df.append(test_data + stdv)
test_df.index = ["low", "high"]

test_df

             A         B         C
low   0.442855  0.433035 -0.142219
high  1.177145  1.066965  0.482219

Results

iterate over the columns, filtering out those outside the wanted range (pandas Series.between() can do this for you!)

for x in df:
    df = df[df[x].between(test_df[x]["low"], test_df[x]["high"])]

resulting df

PID       A     B     C
508.0  0.85  0.51  0.05
400.0  0.97  0.61   0.3
65.0   0.78  0.62  0.25

answered Oct 16 '22 08:10

ti7

Related questions
                            
                                Changing in the Quantity of variants reflecting in the wrong item in Order Summary
                            
                                Google Collab How to show value of assignments?
                            
                                Even though tuples are immutable, they are stored in different addresses in interactive mode. Why?
                            
                                Delete an element from torch.Tensor
                            
                                Why does django's `apps.get_model()` return a `__fake__.MyModel` object
                            
                                ValueError: illegal value in 4-th argument of internal None when running sklearn LinearRegression().fit()
                            
                                how to download all the python packages mentioned in the requirement.txt to a folder in linux?
                            
                                Create CSV from XML/Json using Python Pandas
                            
                                Length (count) of sequences with start and end condition Python
                            
                                Unexpected number of bins in Pandas DataFrame resample
                            
                                Convert epoch, which is midnight 01/01/0001, to DateTime in pandas
                            
                                matplotlib text: Use data coords for x, axis coords for y
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                Trying to understand __init__.py combined with getattr
                            
                                Implementing inplace operations for methods in a class
                            
                                How can I list the extra features of a Python package
                            
                                multiprocessing in python - what gets inherited by forkserver process from parent process?
                            
                                Get hour of year from a Datetime
                            
                                How to terminate loop.run_in_executor with ProcessPoolExecutor gracefully?
                            
                                How to group a dataframe by 4 time periods and key

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering non-'cohorts' from dataset

Tags:

python

pandas

data-modeling

statistics