You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.
For example
Type...
Excel will continue it with:
Same works for some other patterns such as dates and so on.
I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:
Now based on this entries I want say that the pattern is:
Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.
I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).
Document classification can be useful in this scenario but I'm not sure where to start.
UPDATE:
I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.
Such as:
I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.
We all know and love the Auto Fill feature in Excel. Microsoft went a step further in Excel 2013 and created Flash Fill. This new feature recognizes patterns in your data and will finish tedious tasks for you.
Excel allows you to easily change the background pattern used in the cell. In the early days of spreadsheets, patterns were the only way you had to differentiate one cell from another. To change cell patterns, follow these steps: Select the cells whose background patterns you want to change.
As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>....
without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc
and test-48953-defg
then the LCS would be test-
and -
. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.
The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.
Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With