Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex numeric data processing: match a series of numbers greater than X

Tags:

python

regex

Say I have some data like this:

number_stream = [0,0,0,7,8,0,0,2,5,6,10,11,10,13,5,0,1,0,...]

I want to process it looking for "bumps" that meet a certain pattern.

Imagine I have my own customized regex language for working on numbers, where [[ >=5 ]] represents any number >= 5. I want to capture this case:

([[ >=5 ]]{3,})[[ <3 ]]{2,}

In other words, I want to begin capturing any time I look ahead and see 3 or more values >= 5 in a row, and stop capturing any time I look ahead and see 2+ values < 3. So my output should be:

>>> stream_processor.process(number_stream)
[[5,6,10,11,10,13,5],...]

Note that the first 7,8,... is ignored because it's not long enough, and that the capture ends before the 0,1,0....

I'd also like a stream_processor object I can incrementally pass more data into in subsequent process calls, and return captured chunks as they're completed.

I've written some code to do it, but it was hideous and state-machiney, and I can't help feeling like I'm missing something obvious. Any ideas to do this cleanly?

like image 582
Mu Mind Avatar asked May 24 '10 23:05

Mu Mind


2 Answers

State machines (enriched with quite a few extras, since regexes can match a broader range of languages than FSMs can) are a typical approach to implementing regular expression engines, so why shouldn't similar approaches emerge in looking for good implementations of your desired "regex-like" constructs?

Indeed, I would consider starting with the code for an actual RE engine (there's a Python-coded one in the PyPy sources, the mercurial tree for which is here), changing only the "primitives" (you don't need e.g. \w or \s, but you need <5, >3, etc) and keeping most syntax and implementation for *, +, and so on. Such a project would be well worth open-sourcing, by the way.

like image 154
Alex Martelli Avatar answered Nov 13 '22 16:11

Alex Martelli


A state machine is the appropriate solution here. There are two common ways to implement one:

  • Hardwired in the form of a set of mutually-recursive functions where the function that is running represents the current state. This can be very efficient but requires tail call elimination, trampolines or goto.

  • Emulated in the form of a data structure representing the current state and a function of that state and a new input datum that updates the state.

This stuff is the bread and butter of functional programming. Here is an elegant solution to your problem written in the former style in F# and run on your own data set:

> let rec skip = function
    | _, yss, [] -> yss
    | [_; _; _] as ys, yss, xs -> record ([], ys, yss, xs)
    | ys, yss, x::xs when x >= 5 -> skip (x::ys, yss, xs)
    | ys, yss, x::xs -> skip ([], yss, xs)
  and record = function
    | ys', ys, yss, [] -> (ys' @ ys) :: yss
    | [_; _], ys, yss, xs -> skip ([], ys :: yss, xs)
    | ys', ys, yss, x::xs when x < 3 -> record (x::ys', ys, yss, xs)
    | ys', ys, yss, x::xs -> record ([], x::ys' @ ys, yss, xs);;
val skip : int list * int list list * int list -> int list list
val record : int list * int list * int list list * int list -> int list list

> let run xs = skip([], [], xs) |> List.map List.rev |> List.rev;;
val run : int list -> int list list

> run [0;0;0;7;8;0;0;2;5;6;10;11;10;13;5;0;1;0];;
val it : int list list = [[5; 6; 10; 11; 10; 13; 5]]

and here is the same solution written in the latter style:

> type 'a state =
    | Skip of 'a list
    | Record of 'a list * 'a list;;
type 'a state =
  | Skip of 'a list
  | Record of 'a list * 'a list

> let rec apply (state, yss) x =
    match state, yss with
    | Skip([_; _; _] as ys), yss -> apply (Record([], ys), yss) x
    | Skip ys, yss when x >= 5 -> Skip(x::ys), yss
    | Skip ys, yss -> Skip[], yss
    | Record([_; _], ys), yss -> apply (Skip[], ys :: yss) x
    | Record(ys', ys), yss when x < 3 -> Record (x::ys', ys), yss
    | Record(ys', ys), yss -> Record ([], x::ys' @ ys), yss;;
val apply : int state * int list list -> int -> int state * int list list

> let run xs =
    match List.fold apply (Skip [], []) xs with
    | Skip _, yss -> yss
    | Record(ys', ys), yss -> (ys' @ ys) :: yss
    |> List.map List.rev |> List.rev;;
val run : int list -> int list list

> run [0;0;0;7;8;0;0;2;5;6;10;11;10;13;5;0;1;0];;
val it : int list list = [[5; 6; 10; 11; 10; 13; 5]]

Note how the first solution eats the entire input at once whereas the latter bites off a single datum at a time and returns a new state ready to eat another datum. Consequently, the latter is applied to each datum in turn using a fold and the final half-eaten state must be consumed appropriately before returning.

like image 22
J D Avatar answered Nov 13 '22 18:11

J D