Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding runs of similar, not identical, elements in Mathematica

I have a list of numbers. I want to extract from the list runs of numbers that fall inside some band and have some minimum length. For instance, suppose the I want to operate on this list:

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, 1.4, -0.3, -0.1, -0.7}

with band=1 and runLength=3. I would like to have

{{-0.6, -0.8, -0.1}, {-0.3, -0.1, -0.7}}

as the result. Right now I'm using

Cases[
 Partition[thisList,runLength,1],
 x_ /; Abs[Max[x] - Min[x]] < band
]

The main problem is that where runs overlap, I get many copies of the run. For instance, using

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

gives me

{{-0.6, -0.8, -0.1}, {-0.8, -0.1, -0.5}, {-0.1, -0.5, -0.3}, {-0.5, -0.3, -0.1}, {-0.3, -0.1, -0.7}}

where I would rather have

{-0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

without doing some hokey reduction of the overlapping result. What's the proper way? It'd be nice if it didn't involve exploding the data using Partition.

like image 858
ArgentoSapiens Avatar asked Dec 13 '11 19:12

ArgentoSapiens


2 Answers

EDIT

Apparenty, my first solution has at least two serious flaws: it is dead slow and completely impractical for lists larger than 100 elements, and it contains some bug(s) which I wasn't able to identify yet - it is missing some bands sometimes. So, I will provide two (hopefuly correct) and much more efficient alternatives, and I provide the flawed one below for any one interested.

A solution based on linked lists

Here is a solution based on linked lists. It allows us to still use patterns but avoid inefficiencies caused by patterns containing __ or ___ (when repeatedly applied):

ClearAll[toLinkedList];
toLinkedList[x_List] := Fold[{#2, #1} &, {}, Reverse@x]

ClearAll[accumF];
accumF[llFull_List, acc_List, {h_, t_List}, ctr_, max_, min_, band_, rLen_] :=
  With[{cmax = Max[max, h], cmin = Min[min, h]},
     accumF[llFull, {acc, h}, t, ctr + 1, cmax, cmin, band, rLen] /; 
        Abs[cmax - cmin] < band];
accumF[llFull_List, acc_List, ll : {h_, _List}, ctr_, _, _, band_, rLen_] /; ctr >= rLen :=
     accumF[ll, (Sow[acc]; {}), ll, 0, h, h, band, rLen];
accumF[llFull : {h_, t : {_, _List}}, _List, ll : {head_, _List}, _, _, _, band_, rLen_] :=
     accumF[t, {}, t, 0, First@t, First@t, band, rLen];
accumF[llFull_List, acc_List, {}, ctr_, _, _, _, rLen_] /; ctr >= rLen := Sow[acc];

ClearAll[getBandsLL];
getBandsLL[lst_List, runLength_Integer, band_?NumericQ] :=
  Block[{$IterationLimit = Infinity},
     With[{ll = toLinkedList@lst},
        Map[Flatten,
          If[# === {}, #, First@#] &@
            Reap[
              accumF[ll, {}, ll, 0, First@ll, First@ll, band,runLength]
            ][[2]]
        ]
     ]
  ];

Here are examples of use:

In[246]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[246]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[247]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[247]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

The main idea of the function accumF is to traverse the number list (converted to a linked list prior to that), and accumulate a band in another linked list, which is passed to it as a second argument. Once the band condition fails, the accumulated band is memorized using Sow (if it was long enough), and the process starts over with the remaining part of the linked list. The ctr parameter may not be needed if we would choose to use Depth[acc] instead.

There are a few non-obvious things present in the above code. One subtle point is that an attempt to join the two middle rules for accumF into a single rule (they look very similar) and use CompoundExpression (something like (If[ctr>=rLen, Sow[acc];accumF[...])) on the r.h.s. would lead to a non-tail-recursive accumF (See this answer for a more detailed discussion of this issue. This is also why I make the (Sow[acc]; {}) line inside a function call - to avoid the top-level CompoundExpression on the r.h.s.). Another subtle point is that I have to maintain a copy of the linked list containing the remaining elements right after the last successful match was found, since in the case of unsuccessful sequence I need to roll back to that list minus its first element, and start over. This linked list is stored in the first argument of accumF.

Note that passing large linked lists does not cost much, since what is copied is only a first element (head) and a pointer to the rest (tail). This is the main reason why using linked lists vastly improves performance, as compared to the case of patterns like {___,x__,right___} - because in the latter case, a full sequences x or right are copied. With linked lists, we effectively only copy a few references, and therefore our algorithms behave roughly as we expect (linearly in the length of the data list here). In this answer, I also mentioned the use of linked lists in such cases as one of the techniques to optimize code (section 3.4).

Compile - based solution

Here is a straightforward but not too elegant function based on Compile, which finds a list of starting and ending bands positions in the list:

bandPositions = 
  Compile[{{lst, _Real, 1}, {runLength, _Integer}, {band, _Real}},
   Module[{i = 1, j, currentMin, currentMax, 
        startEndPos = Table[{0, 0}, {Length[lst]}], ctr = 0},
    For[i = 1, i <= Length[lst], i++,
      currentMin = currentMax = lst[[i]];
      For[j = i + 1, j <= Length[lst], j++,
        If[lst[[j]] < currentMin,
           currentMin = lst[[j]],
           (* else *)
           If[lst[[j]] > currentMax,
             currentMax = lst[[j]]
           ]
        ];
        If[Abs[currentMax - currentMin] >= band ,
          If[ j - i >= runLength,
             startEndPos[[++ctr]] = {i, j - 1}; i = j - 1
          ];
          Break[],
          (* else *)
          If[j == Length[lst] && j - i >= runLength - 1,
              startEndPos[[++ctr]] = {i, j}; i = Length[lst];
              Break[];
          ];
        ]
      ]; (* inner For *)
    ]; (* outer For *)
    Take[startEndPos, ctr]], CompilationTarget -> "C"];

This is used in the final function:

getBandsC[lst_List, runLength_Integer, band_?NumericQ] :=
   Map[Take[lst, #] &, bandPositions[lst, runLength, band]]

Examples of use:

In[305]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[305]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[306]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[306]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

Benchmarks

In[381]:= 
largeTest  = RandomReal[{-5,5},50000];
(res1 =getBandsLL[largeTest,3,1]);//Timing
(res2 =getBandsC[largeTest,3,1]);//Timing
res1==res2

Out[382]= {1.109,Null}
Out[383]= {0.016,Null}
Out[384]= True

Obviously, if one wants performance, Compile wins hands down. My observations for larger lists are that both solutions have approximately linear complexity with the size of the number list (as they should), with compiled one roughly 150 times faster on my machine than the one based on linked lists.

Remarks

In fact, both methods encode the same algorithm, although this may not be obvious. The one with recursion and patterns is arguably somewhat more understandable, but that is a matter of opinion.

A simple, but slow and buggy version

Here is the original code that I wrote first to solve this problem. This is based on a rather straightforward use of patterns and repeated rule application. As mentioned, one disadvantage of this method is its very bad performance. This is actually another case against using constructs like {___,x__,y___} in conjunction with repeated rule application, for anything longer than a few dozens elements. In the mentioned recommendations for code optimization techniques, this corresponds to the section 4.1.

Anyways, here is the code:

If[# === {}, #, First@#] &@
 Reap[thisList //. {
    left___, 
    Longest[x__] /;Length[{x}] >= runLength && Abs[Max[{x}] - Min[{x}]] < band,
    right___} :> (Sow[{x}]; {right})][[2]]

It works correctly for both of the original small test lists. It also looks generally correct, but for larger lists it often misses some bands, which can be seen by comparison with the other two methods. I wasn't so far able to localize the bug, since the code seems pretty transparent.

like image 187
Leonid Shifrin Avatar answered Oct 31 '22 11:10

Leonid Shifrin


I'd try this instead:

thisList /. {___, Longest[a : Repeated[_, {3, Infinity}]], ___} :> 
               {a} /; Abs[Max@{a} - Min@{a}] < 1

where Repeated[_, {3, Infinity}] guarantees that you get at least 3 terms, and Longest ensures it gives you the longest run possible. As a function,

Clear[f]
f[list_List, band_, minlen_Integer?Positive] := f[list, band, minlen, Infinity]
f[list_List, band_, 
  minlen_Integer?Positive, maxlen_?Positive] /; maxlen >= minlen := 
 list /. {___, Longest[a : Repeated[_, {minlen, maxlen}]], ___} :> {a} /; 
    Abs[Max@{a} - Min@{a}] < band
like image 26
rcollyer Avatar answered Oct 31 '22 13:10

rcollyer