Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an equivalent of StringPosition[] for searching lists? If not, what's the fastest way to implement this?

Is there a function that searches a sequence of elements for a subsequence? I am looking for an analogue of StringPosition for Lists. In my current application I am working with integer lists, but I'd be interested in a general FindSequence[list, pattern, n] function which will find the first n occurrences of pattern in list.


Here's a toy example:

Generate some data:

In[1]:= $HistoryLength = 0    
Out[1]= 0

In[2]:= Timing[digits = First[RealDigits[\[Pi], 2, 10000000]];]    
Out[2]= {26.5, Null}

Let's convert it to a string so we can compare to StringPosition. This is very slow an memory hungry. (The memory is freed when the evaluation finishes.)

In[3]:= Timing[str = StringJoin[ToString /@ digits];]    
Out[3]= {43.813, Null}

I am looking for this subsequence:

In[4]:= patt = {1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 
   1, 0, 1, 1};

In[5]:= strpatt = StringJoin[ToString /@ patt];

Searching the string is very fast:

In[6]:= StringPosition[str, strpatt] // Timing    
Out[6]= {1.047, {{5737922, 5737943}}}

This is a simple implementation of searching for numerical arrays. It's slower than StringPosition:

In[7]:= Timing[
           corr = ListCorrelate[patt, digits];
           Select[Flatten@Position[corr, patt.patt], 
             digits[[# ;; # + Length[patt] - 1]] === patt &]
        ]

Out[7]= {2.234, {5737922}}

Summary:

  1. Is there a builtin that searches lists for subsequences?
  2. If there isn't, what is a fast and elegant implementation for numeric lists (my practical problem)?
  3. What about generic lists that can contain anything? (There are two possibilities here: "static" patterns only such as {1,0,1}, or general ones like {1,_,1}, though these latter ones may introduce complications.)

I expect this will have many solutions, some fast, some more elegant, some more general :-)


Related questions:

  • A fast implementation in Mathematica for Position2D (2D case of the same thing)
  • What is the best way to find the period of a (repeating) list in Mathematica?

Interesting reading:

  • Boyer-Moore string search algorithm

EDIT:

I just found the undocumented LongestCommonSubsequencePositions. LongestCommonSubsequencePositions[a, b] will find the longest common subsequence of the lists a and b, and return position of its first occurrence only in both a and b. (The documented LongestCommonSubsequence, which I was not aware of, will only return the subsequence itself, not its position.)

It is slower than the alternatives above, but it works on general lists that can contain any expression.

In[57]:= LongestCommonSubsequencePositions[digits, patt] // Timing
Out[57]= {5.25, {{5737922, 5737943}, {1, 22}}}
like image 638
Szabolcs Avatar asked Jan 05 '12 09:01

Szabolcs


2 Answers

You can use ReplaceList with a "prefix" and "suffix" of ___ and match the whole list. This gives you all the replacements that can be made (as opposed to Replace). The position of your pattern is then simply the length of the prefix + 1. It's pretty fast as well:

In[40]:= Timing[ReplaceList[digits, Join[{pre___}, patt, {___}] :> Length[{pre}]
   + 1]]

Out[40]= {1.3059, {5737922}}

Edit: figured it's slightly more elegant to use a delayed rule than to map Length afterwards.

like image 113
Jan Pöschko Avatar answered Sep 22 '22 14:09

Jan Pöschko


Please have a look at functions seqPos (general lists) and seqposC (integer lists, compiled), which do exactly what you ask for, and are fast. I used them in this answer (for the question you actually linked to).

Here are the timing results for various solutions:

In[15]:= seqPos[digits, patt] // Timing
Out[15]= {1.297, {5737922}}

In[16]:= seqposC[digits, patt] // Timing
Out[16]= {0.125, {5737922}}

In[17]:= 
Timing[corr = ListCorrelate[patt, digits];
      Select[Flatten@Position[corr, patt.patt], 
         digits[[# ;; # + Length[patt] - 1]] === patt &]]

Out[17]= {0.844, {5737922}}

In[18]:= Timing[
    ReplaceList[digits, Join[{pre__}, patt, {___}] :> Length[{pre}] + 1]]

Out[18]= {0.953, {5737922}}

In[19]:= AbsoluteTiming[cf[digits, patt]]
Out[19]= {3.1914063, 5737922}

These indicate that your approach with ListCorrelate is not bad at all. My first function seqPos (it is actually due to Norbert Pozar) is a bit slower but then it is completely general, while seqposC is much faster.

like image 38
Leonid Shifrin Avatar answered Sep 21 '22 14:09

Leonid Shifrin