Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Finding multiple linear trend lines in a scatter plot

I have the following pandas dataframe -

    Atomic Number      R         C
0             2.0   49.0  0.040306
1             3.0  205.0  0.209556
2             4.0  140.0  0.107296
3             5.0  117.0  0.124688
4             6.0   92.0  0.100020
5             7.0   75.0  0.068493
6             8.0   66.0  0.082244
7             9.0   57.0  0.071332
8            10.0   51.0  0.045725
9            11.0  223.0  0.217770
10           12.0  172.0  0.130719
11           13.0  182.0  0.179953
12           14.0  148.0  0.147929
13           15.0  123.0  0.102669
14           16.0  110.0  0.120729
15           17.0   98.0  0.106872
16           18.0   88.0  0.061996
17           19.0  277.0  0.260485
18           20.0  223.0  0.164312
19           33.0  133.0  0.111359
20           36.0  103.0  0.069348
21           37.0  298.0  0.270709
22           38.0  245.0  0.177368
23           54.0  124.0  0.079491

The trend between r and C is generally a linear one. What I would like to do if possible is find an exhaustive list of all the possible combinations of 3 or more points and what their trends are with scipy.stats.linregress so that I can find groups of points that fit linearly the best.

Which would ideally look something like this for the data, (Source) but I am looking for all the other possible trends too.

So the question, how do I feed all the 16776915 possible combinations (sum_(i=3)^24 binomial(24, i)) of 3 or more points into lingress and is it even doable without a ton of code?

like image 356
Pyrphoros Avatar asked Sep 16 '25 19:09

Pyrphoros


1 Answers

My following solution proposal is based on the RANSAC algorithm. It is method to fit a mathematical model (e.g. a line) to data with heavy of outliers.

RANSAC is one specific method from the field of robust regression.

My solution below first fits a line with RANSAC. Then you remove the data points close to this line from your data set (which is the same as keeping the outliers), fit RANSAC again, remove data, etc until only very few points are left.

Such approaches always have parameters which are data dependent (e.g. noise level or proximity of the lines). In the following solution and MIN_SAMPLES and residual_threshold are parameters which might require some adaption to the structure of your data:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

MIN_SAMPLES = 3

x = np.linspace(0, 2, 100)

xs, ys = [], []

# generate points for thee lines described by a and b,
# we also add some noise:
for a, b in [(1.0, 2), (0.5, 1), (1.2, -1)]:
    xs.extend(x)
    ys.extend(a * x + b + .1 * np.random.randn(len(x)))

xs = np.array(xs)
ys = np.array(ys)
plt.plot(xs, ys, "r.")

colors = "rgbky"
idx = 0

while len(xs) > MIN_SAMPLES:

    # build design matrix for linear regressor
    X = np.ones((len(xs), 2))
    X[:, 1] = xs

    ransac = linear_model.RANSACRegressor(
        residual_threshold=.3, min_samples=MIN_SAMPLES
    )

    res = ransac.fit(X, ys)

    # vector of boolean values, describes which points belong
    # to the fitted line:
    inlier_mask = ransac.inlier_mask_

    # plot point cloud:
    xinlier = xs[inlier_mask]
    yinlier = ys[inlier_mask]

    # circle through colors:
    color = colors[idx % len(colors)]
    idx += 1
    plt.plot(xinlier, yinlier, color + "*")

    # only keep the outliers:
    xs = xs[~inlier_mask]
    ys = ys[~inlier_mask]

plt.show()

In the following plot points shown as stars belong to the clusters detected by my code. You also see a few points depicted as circles which are the points remaining after the iterations. The few black stars form a cluster which you could get rid of by increasing MIN_SAMPLES and / or residual_threshold.

enter image description here

like image 84
rocksportrocker Avatar answered Sep 18 '25 10:09

rocksportrocker