Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamic Regex generation for predictable repeating string patterns in a data feed

I'm currently trying to process a number of data feeds that I have no control over, where I am using Regular Expressions in C# to extract information.

The originator of the data feed is extracting basic row data from their database (like a product name, price, etc), and then formatting that data within rows of English text. For each row, some of the text is repeated static text and some is the dynamically generated text from the database.

e.g

Panasonic TV with FREE Blu-Ray Player

Sony TV with FREE DVD Player + Box Office DVD

Kenwood Hi-Fi Unit with $20 Amazon MP3 Voucher

So the format in this instance is: PRODUCT with FREEGIFT.

PRODUCT and FREEGIFT are dynamic parts of each row, and the "with" text is static. Each feed has about 2000 rows.

Creating a Regular Expression to extract the dynamic parts is trivial.

The problem is that the marketing bods in control of the data feed keep on changing the structure of the static text, usually once a fortnight, so this week I might have:

Brand new Panasonic TV and a FREE Blu-Ray Player if you order today

Brand new Sony TV and a FREE DVD Player + Box Office DVD if you order today

Brand new Kenwood Hi-Fi unit and a $20 Amazon MP3 Voucher if you order today

And next week it will probably be something different, so I have to keep modifying my Regular Expressions...

How would you handle this?

Is there an algorithm to determine static and variable text within repeating rows of strings? If so, what would be the best way to use the output of such an algorithm to programatically create a dynamic Regular Expression?

Thanks for any help or advice.

like image 288
waveydavey Avatar asked Oct 23 '22 11:10

waveydavey


1 Answers

This code isn't perfect, it certainly isn't efficient, and it's very likely to be too late to help you, but it does work. If given a set of strings, it will return the common content above a certain length.

However, as others have mentioned, an algorithm can only give you an approximation, as you could hit a bad batch where all products have the same initial word, and then the code would accidentally identify that content as static. It may also produce mismatches when dynamic content shares values with static content, but as the size of samples you feed into it grows, the chance of error will shrink.

I'd recommend running this on a subset of your data (20000 rows would be a bad idea!) with some sort of extra sanity checking (max # of static elements etc)

Final caveat: it may do a perfect job, but even if it does, how do you know which item is the PRODUCT and which one is the FREEGIFT?

The algorithm

  1. If all strings in the set start with the same character, add that character to the "current match" set, then remove the leading character from all strings
  2. If not, remove the first character from all strings whose first x (minimum match length) characters aren't contained in all the other strings
  3. As soon as a mismatch is reached (case 2), yield the current match if it meets the length requirement
  4. Continue until all strings are exhausted

The implementation

private static IEnumerable<string> FindCommonContent(string[] strings, int minimumMatchLength)
{
    string sharedContent = "";

    while (strings.All(x => x.Length > 0))
    {
        var item1FirstCharacter = strings[0][0];

        if (strings.All(x => x[0] == item1FirstCharacter))
        {
            sharedContent += item1FirstCharacter;

            for (int index = 0; index < strings.Length; index++)
                strings[index] = strings[index].Substring(1);

            continue;
        }

        if (sharedContent.Length >= minimumMatchLength)
            yield return sharedContent;

        sharedContent = "";

        // If the first minMatch characters of a string aren't in all the other strings, consume the first character of that string
        for (int index = 0; index < strings.Length; index++)
        {
            string testBlock = strings[index].Substring(0, Math.Min(minimumMatchLength, strings[index].Length));

            if (!strings.All(x => x.Contains(testBlock)))
                strings[index] = strings[index].Substring(1);
        }
    }

    if (sharedContent.Length >= minimumMatchLength)
        yield return sharedContent;
}

Output

Set 1 (from your example):

FindCommonContent(strings, 4);
=> "with "

Set 2 (from your example):

FindCommonContent(strings, 4);
=> "Brand new ", "and a ", "if you order today"

Building the regex

This should be as simple as:

 "{.*}" + string.Join("{.*}", FindCommonContent(strings, 4)) + "{.*}";
=> "^{.*}Brand new {.*}and a {.*}if you order today{.*}$"

Although you could modify the algorithm to return information about where the matches are (between or outside the static content), this will be fine, as you know some will match zero-length strings anyway.

like image 159
Simon MᶜKenzie Avatar answered Oct 27 '22 09:10

Simon MᶜKenzie