I have the following pandas dataframe -
Atomic Number R C
0 2.0 49.0 0.040306
1 3.0 205.0 0.209556
2 4.0 140.0 0.107296
3 5.0 117.0 0.124688
4 6.0 92.0 0.100020
5 7.0 75.0 0.068493
6 8.0 66.0 0.082244
7 9.0 57.0 0.071332
8 10.0 51.0 0.045725
9 11.0 223.0 0.217770
10 12.0 172.0 0.130719
11 13.0 182.0 0.179953
12 14.0 148.0 0.147929
13 15.0 123.0 0.102669
14 16.0 110.0 0.120729
15 17.0 98.0 0.106872
16 18.0 88.0 0.061996
17 19.0 277.0 0.260485
18 20.0 223.0 0.164312
19 33.0 133.0 0.111359
20 36.0 103.0 0.069348
21 37.0 298.0 0.270709
22 38.0 245.0 0.177368
23 54.0 124.0 0.079491
The trend between r and C is generally a linear one. What I would like to do if possible is find an exhaustive list of all the possible combinations of 3 or more points and what their trends are with scipy.stats.linregress so that I can find groups of points that fit linearly the best.
Which would ideally look something like this for the data, (Source) but I am looking for all the other possible trends too.
So the question, how do I feed all the 16776915 possible combinations (sum_(i=3)^24 binomial(24, i)) of 3 or more points into lingress and is it even doable without a ton of code?
My following solution proposal is based on the RANSAC algorithm. It is method to fit a mathematical model (e.g. a line) to data with heavy of outliers.
RANSAC is one specific method from the field of robust regression.
My solution below first fits a line with RANSAC. Then you remove the data points close to this line from your data set (which is the same as keeping the outliers), fit RANSAC again, remove data, etc until only very few points are left.
Such approaches always have parameters which are data dependent (e.g. noise level or proximity of the lines). In the following solution and MIN_SAMPLES
and residual_threshold
are parameters which might require some adaption to the structure of your data:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
MIN_SAMPLES = 3
x = np.linspace(0, 2, 100)
xs, ys = [], []
# generate points for thee lines described by a and b,
# we also add some noise:
for a, b in [(1.0, 2), (0.5, 1), (1.2, -1)]:
xs.extend(x)
ys.extend(a * x + b + .1 * np.random.randn(len(x)))
xs = np.array(xs)
ys = np.array(ys)
plt.plot(xs, ys, "r.")
colors = "rgbky"
idx = 0
while len(xs) > MIN_SAMPLES:
# build design matrix for linear regressor
X = np.ones((len(xs), 2))
X[:, 1] = xs
ransac = linear_model.RANSACRegressor(
residual_threshold=.3, min_samples=MIN_SAMPLES
)
res = ransac.fit(X, ys)
# vector of boolean values, describes which points belong
# to the fitted line:
inlier_mask = ransac.inlier_mask_
# plot point cloud:
xinlier = xs[inlier_mask]
yinlier = ys[inlier_mask]
# circle through colors:
color = colors[idx % len(colors)]
idx += 1
plt.plot(xinlier, yinlier, color + "*")
# only keep the outliers:
xs = xs[~inlier_mask]
ys = ys[~inlier_mask]
plt.show()
In the following plot points shown as stars belong to the clusters detected by my code. You also see a few points depicted as circles which are the points remaining after the iterations. The few black stars form a cluster which you could get rid of by increasing MIN_SAMPLES
and / or residual_threshold
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With