Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to interpolate Array values in token?

Tags:

raku

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.

So for example I have character with no homoglyph alternatives:

my $f = 'f';

and character that can be obfuscated:

my @o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron

I can easily build regular expression that will detect homoglyphed phrase 'foo':

say 'Suspicious!' if $text ~~ / $f @o @o /;

But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:

my @lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length

Now obviously following solution cannot "unpack" array elements into the regular expression:

/ @lookup / # doing LTM, not searching elements in sequence

I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:

my $regexp-ish = textualize( @lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }

But that is quite error-prone. Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?

like image 282
Pawel Pabian bbkr Avatar asked Jan 06 '20 23:01

Pawel Pabian bbkr


2 Answers

The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.

like image 193
jjmerelo Avatar answered Nov 08 '22 15:11

jjmerelo


I'm not sure this is the best approach to use.

I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2

my token confusable($source) {
  :my $i = 0;                                    # create a counter var
  [
    <?{                                          # succeed only if
      my $a = self.orig.substr: self.pos+$i, 1;  #   the test character A
      my $b = $source.substr: $i++, 1;           #   the source character B and

      so $a eq $b                                #   are the same or
      || $a eq %*confusables{$b}.any;            #   the A is one of B's confusables
    }> 
    .                                            # because we succeeded, consume a char
  ] ** {$source.chars}                           # repeat for each grapheme in the source
}

Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.

You can then have your code work as follows:

say $foo ~~ /<confusable: 'foo'>/

This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.

If you are okay with just 1-to-1 relationships, you can go with a much simpler token:

my token confusable($source) {
  :my @chars = $source.comb;            # split the source 
  @(                                    # match the array based on 
     |(                                 #   a slip of
        %confusables{@chars.head}       #     the confusables 
        // Empty                        #     (or nothing, if none)
     ),                                 #
     @a.shift                           #   and the char itself
   )                                    #
   ** {$source.chars}                   # repeating for each source char
}

The @(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)

In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.


1Unicode calls homographs both "visually similar characters" and "confusables".

2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

like image 4
user0721090601 Avatar answered Nov 08 '22 17:11

user0721090601