Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate random strings that matches a Regex in Julia?

Tags:

regex

julia

Related questions:

  • Random string that matches a regexp
  • How to generate random alphanumeric string in julia?

The question is fairly simple. I found numerous alternatives for other languages, but not in Julia:

Random Text generator based on regex

Also Random.randstring doesn't take Regex as an argument.

like image 342
Thomas Jalabert Avatar asked Nov 24 '19 19:11

Thomas Jalabert


2 Answers

It should be possible to use Automa.jl to build a DFA and randomly traverse it. Automa uses a simpler syntax than PCRE, so the languange you can describe by it should actually be regular.

I quickly threw together the following, based mostly on the code in dot.jl:

julia> function rand_re(machine::Automa.Machine)
           out = IOBuffer()
           node = machine.start
           
           while true
               if node.state ∈ machine.final_states
                   (rand() ≤ 1 / (length(node.edges) + 1)) && break
               end
               
               edge, node = rand(node.edges)
               label = rand(collect(edge.labels))
               print(out, Char(label))
           end
           
           return String(take!(out))
       end
rand_re (generic function with 1 method)

julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a6bbb"

julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a9b"

julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a3aa"

julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a1a"

julia> rand_re(Automa.compile(re"a[0-9][ab]+"))
"a5ba"

The caveat is that Automa uses byte-encoded sets for edge labels, so more care should be taken where I just write Char(label).

Since final states can still have outgoing edges, I chose to treat stopping and each edge with uniform probability. I think this will likely have the effect that potentially infinite terms will either be very short or very long; google "Boltzmann samplers" for how to solve that (not to be confused with sampling from a Boltzmann distribution!), but the solution is rather mathematically involved.

Alternatively, you could use ccall or PyCall to call rxvm_gen or Rxvm.gen of librxvm, which contains (probably) a quite performant code for non-backtracking regular expressions.

like image 104
phipsgabler Avatar answered Sep 28 '22 09:09

phipsgabler


Julia has PCRE, which means its regular expressions are far more powerful than true regular expressions. And are in-fact turing complete. I suspect there is a bunch of interesting theoretical computer science around this. I suspect your task for PCRE might be proved to be impossible because of the halting problem. But still what we can do is try a bunch of random strings and toss out those that don't match. And for simple regex that goes a long way. Its not guaranteed to give an answer though.

If one wanted stricter regex, like those covered by Automa.jl, there is probably something better that can be done, since you can walk the state machine solving it 1 bit at a time. Hopefully someone that knows Automa.jl can post their own answer.

Code

using Random: randstring

function rand_matching(regex; max_len=2^16, max_attempts=1000)
    for _ in max_attempts
        str  = randstring(max_len)
        m = match(regex, str)
        if m != nothing
            # rather than return whole string, 
            # just return the shortest bit that matches
            return m.match
        end
    end
    error("Could not find any string that matches regex")
end

demo:

julia> @time rand_matching(r"\d\d")
  0.013517 seconds (34.34 k allocations: 1.998 MiB)
"38"

julia> @time rand_matching(r"\d\d")
  0.001497 seconds (11 allocations: 128.656 KiB)
"44"

julia> @time rand_matching(r"a\d\d")
  0.000670 seconds (11 allocations: 128.656 KiB)
"a19"

julia> @time rand_matching(r"a\d\d")
  0.000775 seconds (11 allocations: 128.656 KiB)
"a83"

julia> @time rand_matching(r"a\d\db")
  0.000670 seconds (11 allocations: 128.656 KiB)
"a44b"
like image 31
Lyndon White Avatar answered Sep 28 '22 09:09

Lyndon White