I am sure this topic has been researched before, I am not sure what it is called or what techniques I should also look into, hence why I am here. I am running this mainly in Python and Pandas but it is not limited to those languages/technologies.
As an example, let's pretend I have this dataset:
| PID | A | B | C |
| --- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 |
| 400 | 0.97 | 0.61 | 0.30 |
| 251 | 0.01 | 0.97 | 0.29 |
| 414 | 0.25 | 0.04 | 0.83 |
| 706 | 0.37 | 0.32 | 0.33 |
| 65 | 0.78 | 0.62 | 0.25 |
| 533 | 0.24 | 0.15 | 0.88 |
PID is a unique ID to that row. A, B and C are some factors (normalized for this example). This datasets could be players in a sports league over history, it could be products in an inventory, it could be voter data. The specific context isn't important.
Now let's say I have some input data:
| A | B | C |
| ---- | ---- | ---- |
| 0.81 | 0.75 | 0.17 |
This input shares the same factors as the original dataset (A, B, C). What I want to do is to find the rows that are similar to my input data (the "cohorts"). What is the best way to approach this?
I thought of clustering, using a kNN algorithm, but the problem is that the number of cohorts is not set. You could have unique input and have few/no "cohorts", or you could have input that is very common and have hundreds of "cohorts".
The solution I next tried was Euclidean Distance. So for this dataset and input I would do something like:
my_cols = ['A', 'B', 'C']
inputdata = pd.Series([0.81, 0.75, 0.17], index=['A', 'B', 'C'])
# df = pandas data frame with above data
df['Dict'] = (df[my_cols] - inputdata).pow(2).sum(1).pow(0.5)
This would create a new column on the dataset like:
| PID | A | B | C | Dist |
| --- | ---- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 | 0.27 |
| 400 | 0.97 | 0.61 | 0.30 | 0.25 |
| 251 | 0.01 | 0.97 | 0.29 | 0.84 |
| 414 | 0.25 | 0.04 | 0.83 | 1.12 |
| 706 | 0.37 | 0.32 | 0.33 | 0.63 |
| 65 | 0.78 | 0.62 | 0.25 | 0.16 |
| 533 | 0.24 | 0.15 | 0.88 | 1.09 |
You can then "filter" out those rows below some threshold.
cohorts = df[df['Dist'] <= THRESHOLD]
The issue then becomes (1) How do you determine that best threshold? and (2) If I add a 4th factor ("D") into the dataset and Euclid calculation, it seems to "break" the results, in that the cohorts no longer make intuitive sense, looking at the results.
So my question is: what are techniques or better ways to filter/select "cohorts" (those rows similar to an input row) ?
Thank you
We can see from the shape method that 352 rows have been filtered out of the dataset. Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!
Then, the sequence of booleans is placed inside square brackets [], returning the rows associated with a True value. The most common way to filter a data frame according to the values of a single column is by using a comparison operator.
Data filtering in Pandas. The complete guide to clean data sets —… | by Amanda Iglesias Moreno | Towards Data Science Filtering data from a data frame is one of the most common operations when cleaning the data. Pandas provides a wide range of methods for selecting data according to the position and label of the rows and columns.
The filter () method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor ()) , range operators (between (), near ()) as well as NA value check against the column values. The subset dataframe has to be retained in a separate variable.
Here is an algorithm I came up with myself by logical thinking and some basic statistics. It uses the mean
of the values and the mean of your input data to find the closest matches based on the standard deviation
using pd.merge_asof
:
factors = ['A', 'B', 'C']
df = df.assign(avg=df[factors].mean(axis=1)).sort_values('avg')
input_data = input_data.assign(avg=input_data[factors].mean(axis=1)).sort_values('avg')
dfn = pd.merge_asof(
df,
input_data,
on='avg',
direction='nearest',
tolerance=df['avg'].std()
)
PID A_x B_x C_x avg A_y B_y C_y
0 706 0.37 0.32 0.33 0.340000 NaN NaN NaN
1 414 0.25 0.04 0.83 0.373333 NaN NaN NaN
2 251 0.01 0.97 0.29 0.423333 NaN NaN NaN
3 533 0.24 0.15 0.88 0.423333 NaN NaN NaN
4 508 0.85 0.51 0.05 0.470000 NaN NaN NaN
5 65 0.78 0.62 0.25 0.550000 0.81 0.75 0.17
6 400 0.97 0.61 0.30 0.626667 0.81 0.75 0.17
You're facing a clustering problem so your K-means intuition was right.
But, as you mentioned K-means is a parametric approach, so you need to determine the right K. There is an automated way of finding the best K regarding cluster quality (shape, stability, homogeneity) which is named the elbow method: https://www.scikit-yb.org/en/latest/api/cluster/elbow.html
Then, you can use another clustering approach (in fact the right clustering algorithm depends of the meaning of your features), for example you can use a density based approach with DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html).
Thus, you'll need to identify the best clustering algorithm depending on your problem: https://machinelearningmastery.com/clustering-algorithms-with-python/
With this solution you'll fit your clustering algorithm on your training set (the one you name "cohort" set), and the use the model to predict the cluster on your "non-cohort" samples.
In some fields, like the marketing field, you'll also find methods to create clusters (cohorts) based on some numerical attributes and using descriptive statistics.
The best example is the RFM segmentation method, which is a really smart way of doing clustering while keeping high intelligibility for resulting clusters: https://towardsdatascience.com/know-your-customers-with-rfm-9f88f09433bc
Using this approach you'll build your features on your entire set of data, and then get the resulting segments depending on feature values.
As you do not really know the number of clusters (cohorts) or their structure, I believe that OPTICS algorithm would suit you best. It finds a group of points that are packed together (using Euclidian distance), and expands to build a cluster from them. Then it is easy to find the cluster a new point belongs (or not) to. It is similar to the DBSCAN, but does not make assumption of similar density among the clusters. sklearn
library includes the implementation of the OPTICS.
My understanding is that you want the distance from each column to be independently accounted for, but collected together in the final result.
To get that independent accounting, you can find out a measure of how different the members in a column are by using its standard deviation σ (whimsical set of explanations).
To collect together the final result, you can filter your dataframe iteratively, removing rows which are outside the wanted range. This also successively reduces the processing time, though it'll be negligible unless you have a great deal of data.
If adding your fourth column causes no data to be sufficiently close, this could indicate
scikit.stats.normaltest
)If the second or third is the case, you should not use the normal standard deviation, but one from from another distribution (list and more tests)
However, if your data is seemingly random, you can apply some factor and/or power (ie. the variance) of the standard deviation in each column to get more or less-accurate results.
initial dataframes
starting data (df)
PID A B C
508.0 0.85 0.51 0.05
400.0 0.97 0.61 0.3
251.0 0.01 0.97 0.29
414.0 0.25 0.04 0.83
706.0 0.37 0.32 0.33
65.0 0.78 0.62 0.25
533.0 0.24 0.15 0.88
test data (test_data)
A B C
0 0.81 0.75 0.17
df.std()
find the standard deviation of each column and collect it into a new dataframe
then assemble another dataframe with this
stdv = df.std()
PID
A 0.367145
B 0.316965
C 0.312219
test_df = pd.DataFrame()
test_df = test_df.append(test_data - stdv)
test_df = test_df.append(test_data + stdv)
test_df.index = ["low", "high"]
test_df
A B C
low 0.442855 0.433035 -0.142219
high 1.177145 1.066965 0.482219
Results
iterate over the columns, filtering out those outside the wanted range (pandas Series.between()
can do this for you!)
for x in df:
df = df[df[x].between(test_df[x]["low"], test_df[x]["high"])]
resulting df
PID A B C
508.0 0.85 0.51 0.05
400.0 0.97 0.61 0.3
65.0 0.78 0.62 0.25
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With