Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to reduce sequences in an array of strings

Tags:

c#

.net

algorithm

Please, now that I've re-written the question, and before it suffers from further fast-gun answers or premature closure by eager editors let me point out that this is not a duplicate of this question. I know how to remove duplicates from an array.

This question is about removing sequences from an array, not duplicates in the strict sense.

Consider this sequence of elements in an array;

[0] a
[1] a
[2] b
[3] c
[4] c
[5] a
[6] c
[7] d
[8] c
[9] d

In this example I want to obtain the following...

[0] a
[1] b
[2] c
[3] a
[4] c
[5] d

Notice that duplicate elements are retained but that sequences of the same element have been reduced to a single instance of that element.

Further, notice that when two lines repeat they should be reduced to one set (of two lines).

[0] c
[1] d
[2] c
[3] d

...reduces to...

[0] c
[1] d

I'm coding in C# but algorithms in any language appreciated.

like image 540
Ed Guiness Avatar asked Sep 11 '08 16:09

Ed Guiness


1 Answers

EDIT: made some changes and new suggestions

What about a sliding window...

REMOVE LENGTH 2: (no other length has other matches)
//the lower case letters are the matches
ABCBAbabaBBCbcbcbVbvBCbcbcAB  
__ABCBABABABBCBCBCBVBVBCBCBCAB

REMOVE LENGTH 1 (duplicate characters):
//* denote that a string was removed to prevent continual contraction
//of the string, unless this is what you want.
ABCBA*BbC*V*BC*AB
_ABCBA*BBC*V*BC*AB

RESULT:
ABCBA*B*C*V*BC*AB == ABCBABCVBCAB

This is of course starting with length=2, increase it to L/2 and iterate down.

I'm also thinking of two other approaches:

  1. digraph - Set a stateful digraph with the data and iterate over it with the string, if a cycle is found you'll have a duplication. I'm not sure how easy it is check check for these cycles... possibly some dynamic programming, so it could be equivlent to method 2 below. I'm going to have to think about this one as well longer.
  2. distance matrix - using a levenstein distance matrix you might be able to detect duplication from diagonal movement (off the diagonal) with cost 0. This could indicate duplication of data. I will have to think about this more.
like image 154
nlucaroni Avatar answered Sep 29 '22 01:09

nlucaroni