Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare text to template to detect anomalies (reverse template)

I'm looking for an algorithm or even an algorithm space that deals with the problem of validating that short text (email) matches known templates. Coding will probably be python or perl, but that's flexible.

Here's the problem:

Servers with access to production data need to be able to send out email that will reach the Internet:

Dear John Smith,

We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
      $12.34 Spuznitz, LLC on 4/1
      $43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal. 
Thank you for your business!

Obviously some of the email contents will vary - the salutation ("John Smith"), "$123.45 on 2/4/13", and the lines with transactions printed out. Other parts ("We received your last payment") are very static. I want to be able to match the static portions of the text and quantify that the dynamic portions are within certain reasonable limits (I might know that the most transaction lines to be printed is 5, for example).

Because I am concerned about data exfiltration, I want to make sure email that doesn't match this template never goes out - I want to examine email and quarantine anything that doesn't look like what I expect. So I need to automate this template matching and block any email messages that are far enough away from matching.

So the question is, where do I look for a filtering mechanism? Bayesian filtering tries to verify a sufficient similarity between a specific message and a non-specific corpus, which is kind of the opposite problem. Things like Perl's Template module are a tight match - but for output, not for input or comparison. Simple 'diff' type comparisons won't handle the limited dynamic info very well.

How do I test to see if these outgoing email messages "quack like a duck"?

like image 352
gowenfawr Avatar asked Feb 05 '13 18:02

gowenfawr


2 Answers

You could use grammars for a tight matching. It is possible to organize regexps in grammars for easier abstraction: http://www.effectiveperlprogramming.com/blog/1479

Or you could use a dedicated grammar engine Marpa.

If you want a more statistic approach, consider n-grams. First, tokenize text and replace variable chunks by meaningful placeholders, like CURRENCY and DATE. Then, build the n-grams. Now you can use Jaccard index to compare two texts.

Here is a Pure-Perl implementation which works on trigrams:

#!/usr/bin/env perl
use strict;
use utf8;
use warnings;

my $ngram1 = ngram(3, tokenize(<<'TEXT1'));
Dear John Smith,

We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
      $12.34 Spuznitz, LLC on 4/1
      $43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal. 
Thank you for your business!
TEXT1

my $ngram2 = ngram(3, tokenize(<<'TEXT2'));
Dear Sally Bates,

We received your last payment for $456.78 on 6/9/12. We'd like you to be aware of the following charges:
      $123,43 Gnomovision on 10/1
As always, you can view these transactions in our portal. 
Thank you for your business!
TEXT2

my %intersection =
    map { exists $ngram1->[2]{$_} ? ($_ => 1) : () }
    keys %{$ngram2->[2]};
my %union =
    map { $_ => 1 }
    keys %{$ngram1->[2]}, keys %{$ngram2->[2]};

printf "Jaccard similarity coefficient: %0.3f\n", keys(%intersection) / keys(%union);

sub tokenize {
    my @words = split m{\s+}x, lc shift;

    for (@words) {
        s{\d{1,2}/\d{1,2}(?:/\d{2,4})?}{ DATE }gx;
        s{\d+(?:\,\d{3})*\.\d{1,2}}{ FLOAT }gx;
        s{\d+}{ INTEGER }gx;
        s{\$\s(?:FLOAT|INTEGER)\s}{ CURRENCY }gx;
        s{^\W+|\W+$}{}gx;
    }

    return @words;
}

sub ngram {
    my ($size, @words) = @_;
    --$size;

    my $ngram = [];
    for (my $j = 0; $j <= $#words; $j++) {
        my $k = $j + $size <= $#words ? $j + $size : $#words;
        for (my $l = $j; $l <= $k; $l++) {
            my @buf;
            for my $w (@words[$j..$l]) {
                push @buf, $w;
            }
            ++$ngram->[$#buf]{join(' ', @buf)};
        }
    }

    return $ngram;
}

You can use one text as a template and match it against your emails. Check String::Trigram for an efficient implementation. Google Ngram Viewer is a nice resource to illustrate the n-gram matching.

like image 89
creaktive Avatar answered Sep 28 '22 11:09

creaktive


If you want to match a pre-existing template with e.g. control flow elements like {% for x in y %} against a supposed output from it, you're going to have to parse the template language - which seems like a lot of work.

On the other hand, if you're prepared to write a second template for validation purposes - something like:

Dear {{customer}},

We received your last payment for {{currency}} on {{full-date}}\. We'd like you to be aware of the following charges:
(      {{currency}} {{supplier}} on {{short-date}}
){,5}As always, you can view these transactions in our portal\. 

... which is just a simple extension of regex syntax, it's pretty straightforward to hack something together that will validate against that:

import re

FIELDS = {
    "customer": r"[\w\s\.-]{,50}",
    "supplier": r"[\w\s\.,-]{,30}",
    "currency": r"[$€£]\d+\.\d{2}",
    "short-date": r"\d{,2}/\d{,2}",
    "full-date": r"\d{,2}/\d{,2}/\d{2}",
}

def validate(example, template_file):
    with open(template_file) as f:
        template = f.read()
    for tag, pattern in FIELDS.items():
        template = template.replace("{{%s}}" % tag, pattern)
    valid = re.compile(template + "$")
    return (re.match(valid, example) is not None)

The example above isn't the greatest Python code of all time by any means, but it's enough to get the general idea across.

like image 40
Zero Piraeus Avatar answered Sep 28 '22 10:09

Zero Piraeus