Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging approximately equal points in dataset

Tags:

c#

algorithm

I'm looking for an algorithm that can quickly run through a short (<30 element) array and merge points that are approximately equal. It'll probably end up being some sort of segmentation algorithm.

The context is as follows: I'm looking for the tallest peaks in a dataset. I've already separated the tallest maximums from the dross using a one-dimensional implementation of J-SEG, but anywhere where the dataset is "flat," I get back a point for every element along the plateau. I need to be able to adaptively merge these points to a single point at the center of the plateau. (This also means I don't know how many clusters there will be.)

Sample dataset 1 (Sample/Artificial input) Input:

97  54686024814922.8
118 406406320535.935
148 24095826539423.7
152 1625624905272.95
160 1625625128029.81
166 1625625152145.47
176 1625625104745.48
179 1625625127365.09
183 1625625152208.44
190 1625624974205.81
194 21068100428092.9
247 54686024895222.1

Ideal Output:

97  54686024814922.8
118 406406320535.935
148 24095826539423.7
159 1625625061816.08


182 1625625089631.21



194 21068100428092.9
247 54686024895222.1

Sample dataset 2 (Real input): Input:

2   196412376940671
123 206108518197124
135 194488685387149
148 178463949513298
154 192912098976702
156 195042451997727
161 195221254214493
168 204760073508681
172 189240741651297
182 191554457423846
187 215014126955355
201 202294866774063

Ideal output:

2   196412376940671
123 206108518197124
135 194488685387149
148 178463949513298
157 194391935062974


168 204760073508681
172 189240741651297
182 191554457423846
187 215014126955355
201 202294866774063

Sample Dataset 3 (Real input) Input:

2   299777367852602
26  263467434856928
35  293412234811901
83  242768805551742
104 226333969841383
107 227548774800053
178 229173574175201
181 229224441416751
204 244334414017228
206 245258151638118
239 198782930497571

Ideal output:

2       299777367852602
26      263467434856928 (May be merged
35      293412234811901 depending on parameters)
83      242768805551742
105.5   226941372320718

179.5   229199007795976

205     244796282827673

239     198782930497571

(Will edit in further information as needed.)

like image 554
linkhyrule5 Avatar asked Feb 10 '12 02:02

linkhyrule5


1 Answers

I'm not sure if this is exactly what you want, but there aren't any other answers posted yet so here we go.

I looked at it from the perspective of a graph. If I were looking at a graph and I wanted to determine which points were horizontally similar that would end up being relative to the graphs scale. So I made a function that accepts a percentage of the scale that you want to be considered the same. It then takes that percentage and multiplies it by the maximum difference between your dataset.

Additionally the similar value is always compared against the average of the currently located plateau. Once a a plateau is detected to end it adds the x's together and divides by 2 to get the middle and then takes the average y value and adds it as a final data point.

Without having access to good sample data all I have to go on is my very poor data generator that I made. But in my testing values within 1% generally eliminated about half of my data points.

Now it's important to note that this is one dimensional, x distance is completely ignored. You could very easily expand it to be two dimension as well. Also something else that I considered doing is instead of outputting a single data point to represent plateaus you could output a start and end point of the average instead.

namespace PointCondenser
{
    public static class Extensions
    {
        static public bool AlmostEqual<T>(this T value, T value2, T epsilon)
        {
            return (Math.Abs((dynamic)value - value2) < epsilon);
        }
    }

    public struct Point
    {
        public Point(double x, double y)
        {
            X = x;
            Y = y;
        }

        public override string ToString()
        {
            return string.Format("{0}\t{1}", X, Y);
        }

        public double X;
        public double Y;
    }
    class Program
    {
        static public Point RandomYPoint(int i)
        {
            var r = new Random();

            var r2 = new Random(i);

            var variance = r2.NextDouble() / 100;

            return new Point(i, Math.Abs(r.NextDouble() - variance) * 100);
        }

        static public IEnumerable<Point> SmoothPoints(IEnumerable<Point> points, double percent)
        {
            if (percent <= 0 || percent >= 1)
                throw new ArgumentOutOfRangeException("percent", "Percentage outside of logical bounds");

            var final = new List<Point>();

            var apoints = points.ToArray();

            var largestDifference = apoints.Max(x => x.Y) - apoints.Min(x => x.Y);
            var epsilon = largestDifference * percent;

            var currentPlateau = new List<Point> { apoints[0] };

            for (var i = 1; i < apoints.Length; ++i)
            {
                var point = apoints[i];
                if (point.Y.AlmostEqual(currentPlateau.Average(x => x.Y), epsilon))
                    currentPlateau.Add(point);
                else
                {
                    var x = (currentPlateau[0].X + currentPlateau[currentPlateau.Count - 1].X) / 2.0;
                    var y = currentPlateau.Average(z => z.Y);

                    currentPlateau.Clear();
                    currentPlateau.Add(point);

                    final.Add(new Point(x, y));
                }
            }

            return final;
        }

        static void Main(string[] args)
        {
            var r = new Random();
            var points = new List<Point>();


            for (var i = 0; i < 100; ++i)
            {
                for (var n = 0; n < r.Next(1, 5); ++n)
                {
                    var p = RandomYPoint(points.Count);
                    points.Add(p);
                    Console.WriteLine(p);
                }
                Thread.Sleep(r.Next(10, 250));
            }

            Console.Write("\n\n Condensed \n\n");


            var newPoints = SmoothPoints(points, .01);

            foreach (var p in newPoints)
                Console.WriteLine(p);
        }
    }
}
like image 166
iwalkbarefoot Avatar answered Oct 14 '22 01:10

iwalkbarefoot