Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering non-'cohorts' from dataset

I am sure this topic has been researched before, I am not sure what it is called or what techniques I should also look into, hence why I am here. I am running this mainly in Python and Pandas but it is not limited to those languages/technologies.

As an example, let's pretend I have this dataset:

| PID | A    | B    | C    |
| --- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 |
| 400 | 0.97 | 0.61 | 0.30 |
| 251 | 0.01 | 0.97 | 0.29 |
| 414 | 0.25 | 0.04 | 0.83 |
| 706 | 0.37 | 0.32 | 0.33 |
| 65  | 0.78 | 0.62 | 0.25 |
| 533 | 0.24 | 0.15 | 0.88 |

PID is a unique ID to that row. A, B and C are some factors (normalized for this example). This datasets could be players in a sports league over history, it could be products in an inventory, it could be voter data. The specific context isn't important.

Now let's say I have some input data:

| A    | B    | C    |
| ---- | ---- | ---- |
| 0.81 | 0.75 | 0.17 |

This input shares the same factors as the original dataset (A, B, C). What I want to do is to find the rows that are similar to my input data (the "cohorts"). What is the best way to approach this?

I thought of clustering, using a kNN algorithm, but the problem is that the number of cohorts is not set. You could have unique input and have few/no "cohorts", or you could have input that is very common and have hundreds of "cohorts".

The solution I next tried was Euclidean Distance. So for this dataset and input I would do something like:

my_cols = ['A', 'B', 'C']

inputdata = pd.Series([0.81, 0.75, 0.17], index=['A', 'B', 'C'])

# df = pandas data frame with above data

df['Dict'] = (df[my_cols] - inputdata).pow(2).sum(1).pow(0.5)

This would create a new column on the dataset like:

| PID | A    | B    | C    | Dist |
| --- | ---- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 | 0.27 |
| 400 | 0.97 | 0.61 | 0.30 | 0.25 |
| 251 | 0.01 | 0.97 | 0.29 | 0.84 |
| 414 | 0.25 | 0.04 | 0.83 | 1.12 |
| 706 | 0.37 | 0.32 | 0.33 | 0.63 |
| 65  | 0.78 | 0.62 | 0.25 | 0.16 |
| 533 | 0.24 | 0.15 | 0.88 | 1.09 |

You can then "filter" out those rows below some threshold.

cohorts = df[df['Dist'] <= THRESHOLD]

The issue then becomes (1) How do you determine that best threshold? and (2) If I add a 4th factor ("D") into the dataset and Euclid calculation, it seems to "break" the results, in that the cohorts no longer make intuitive sense, looking at the results.

So my question is: what are techniques or better ways to filter/select "cohorts" (those rows similar to an input row) ?

Thank you

like image 663
Reily Bourne Avatar asked Oct 02 '20 15:10

Reily Bourne


People also ask

How many rows have been filtered out of the dataset?

We can see from the shape method that 352 rows have been filtered out of the dataset. Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

How to filter a data frame according to a single column?

Then, the sequence of booleans is placed inside square brackets [], returning the rows associated with a True value. The most common way to filter a data frame according to the values of a single column is by using a comparison operator.

What is data filtering in pandas?

Data filtering in Pandas. The complete guide to clean data sets —… | by Amanda Iglesias Moreno | Towards Data Science Filtering data from a data frame is one of the most common operations when cleaning the data. Pandas provides a wide range of methods for selecting data according to the position and label of the rows and columns.

How to use filter () method in R?

The filter () method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor ()) , range operators (between (), near ()) as well as NA value check against the column values. The subset dataframe has to be retained in a separate variable.


4 Answers

Here is an algorithm I came up with myself by logical thinking and some basic statistics. It uses the mean of the values and the mean of your input data to find the closest matches based on the standard deviation using pd.merge_asof:

factors = ['A', 'B', 'C']
df = df.assign(avg=df[factors].mean(axis=1)).sort_values('avg')
input_data = input_data.assign(avg=input_data[factors].mean(axis=1)).sort_values('avg')

dfn = pd.merge_asof(
    df,
    input_data,
    on='avg',
    direction='nearest',
    tolerance=df['avg'].std()
)
   PID   A_x   B_x   C_x       avg   A_y   B_y   C_y
0  706  0.37  0.32  0.33  0.340000   NaN   NaN   NaN
1  414  0.25  0.04  0.83  0.373333   NaN   NaN   NaN
2  251  0.01  0.97  0.29  0.423333   NaN   NaN   NaN
3  533  0.24  0.15  0.88  0.423333   NaN   NaN   NaN
4  508  0.85  0.51  0.05  0.470000   NaN   NaN   NaN
5   65  0.78  0.62  0.25  0.550000  0.81  0.75  0.17
6  400  0.97  0.61  0.30  0.626667  0.81  0.75  0.17
like image 170
Erfan Avatar answered Oct 16 '22 07:10

Erfan


You're facing a clustering problem so your K-means intuition was right.

clustering

But, as you mentioned K-means is a parametric approach, so you need to determine the right K. There is an automated way of finding the best K regarding cluster quality (shape, stability, homogeneity) which is named the elbow method: https://www.scikit-yb.org/en/latest/api/cluster/elbow.html

Then, you can use another clustering approach (in fact the right clustering algorithm depends of the meaning of your features), for example you can use a density based approach with DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html).

Thus, you'll need to identify the best clustering algorithm depending on your problem: https://machinelearningmastery.com/clustering-algorithms-with-python/

With this solution you'll fit your clustering algorithm on your training set (the one you name "cohort" set), and the use the model to predict the cluster on your "non-cohort" samples.

statistical cohorts

In some fields, like the marketing field, you'll also find methods to create clusters (cohorts) based on some numerical attributes and using descriptive statistics.

The best example is the RFM segmentation method, which is a really smart way of doing clustering while keeping high intelligibility for resulting clusters: https://towardsdatascience.com/know-your-customers-with-rfm-9f88f09433bc

Using this approach you'll build your features on your entire set of data, and then get the resulting segments depending on feature values.

like image 1
fpajot Avatar answered Oct 16 '22 08:10

fpajot


As you do not really know the number of clusters (cohorts) or their structure, I believe that OPTICS algorithm would suit you best. It finds a group of points that are packed together (using Euclidian distance), and expands to build a cluster from them. Then it is easy to find the cluster a new point belongs (or not) to. It is similar to the DBSCAN, but does not make assumption of similar density among the clusters. sklearn library includes the implementation of the OPTICS.

like image 1
igrinis Avatar answered Oct 16 '22 07:10

igrinis


My understanding is that you want the distance from each column to be independently accounted for, but collected together in the final result.

To get that independent accounting, you can find out a measure of how different the members in a column are by using its standard deviation σ (whimsical set of explanations).

To collect together the final result, you can filter your dataframe iteratively, removing rows which are outside the wanted range. This also successively reduces the processing time, though it'll be negligible unless you have a great deal of data.

If adding your fourth column causes no data to be sufficiently close, this could indicate

  • your test data is really not close to any of the source data and is a unique entry
  • your data is not normally distributed (if more data is available, you can test this with scikit.stats.normaltest)
  • your columns are not independent (ie. need more specialized statistical handling)

If the second or third is the case, you should not use the normal standard deviation, but one from from another distribution (list and more tests)

However, if your data is seemingly random, you can apply some factor and/or power (ie. the variance) of the standard deviation in each column to get more or less-accurate results.


initial dataframes

starting data (df)

PID       A     B     C
508.0  0.85  0.51  0.05
400.0  0.97  0.61   0.3
251.0  0.01  0.97  0.29
414.0  0.25  0.04  0.83
706.0  0.37  0.32  0.33
65.0   0.78  0.62  0.25
533.0  0.24  0.15  0.88

test data (test_data)

      A     B     C
0  0.81  0.75  0.17

df.std()

find the standard deviation of each column and collect it into a new dataframe

then assemble another dataframe with this

stdv = df.std()

PID
A    0.367145
B    0.316965
C    0.312219

test_df = pd.DataFrame()
test_df = test_df.append(test_data - stdv)
test_df = test_df.append(test_data + stdv)
test_df.index = ["low", "high"]

test_df

             A         B         C
low   0.442855  0.433035 -0.142219
high  1.177145  1.066965  0.482219

Results

iterate over the columns, filtering out those outside the wanted range (pandas Series.between() can do this for you!)

for x in df:
    df = df[df[x].between(test_df[x]["low"], test_df[x]["high"])]

resulting df

PID       A     B     C
508.0  0.85  0.51  0.05
400.0  0.97  0.61   0.3
65.0   0.78  0.62  0.25
like image 1
ti7 Avatar answered Oct 16 '22 08:10

ti7