Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to spot and analyse similar patterns like Excel does?

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.

For example

Type...

  • test-1
  • test-2
  • test-3

Excel will continue it with:

  • test-4
  • test-5
  • test-n...

Same works for some other patterns such as dates and so on.

I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:

  • test-blue-somethingelse
  • test-yellow-somethingelse
  • test-red-somethingelse

Now based on this entries I want say that the pattern is:

  • test-[DYNAMIC]-something

Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.

I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).

Document classification can be useful in this scenario but I'm not sure where to start.

UPDATE:

I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.

Such as:

  • test_[DYNAMIC]12[DYNAMIC2]

I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.

like image 747
dr. evil Avatar asked Sep 07 '09 12:09

dr. evil


People also ask

Can Excel detect patterns?

We all know and love the Auto Fill feature in Excel. Microsoft went a step further in Excel 2013 and created Flash Fill. This new feature recognizes patterns in your data and will finish tedious tasks for you.

What is a pattern in Excel?

Excel allows you to easily change the background pattern used in the cell. In the early days of spreadsheets, patterns were the only way you had to differentiate one cell from another. To change cell patterns, follow these steps: Select the cells whose background patterns you want to change.


1 Answers

As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.

The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.

Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.

like image 92
Il-Bhima Avatar answered Nov 11 '22 18:11

Il-Bhima