Array (3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32)
the frequent sequence of numbers will be (3, 5) f=2 + (4, 7, 13) f=2
any Algorithm or Pseudo code to find that ?
Update(1):
if (7, 13)
also occurrence it will be included in the longest one by update its frequency so
(4, 7, 13) f=3
and so on...
Update(2):
in case of (1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2)
the output should be (1,2,3,4)
& (3,4,1,2)
& (7,8)
, to make it clear consider each number as a word and you want to find most frequent phrases
so it is common to see same word(s) in a lot of phrases but if any phrase was sub-string for any other
phrase(s) should not be consider as a phrase but will update frequency of each phrase includes it
** EDIT ** : slightly better implementation, now also returns frequences and has a better sequence filter.
function getFrequences($input, $minimalSequenceSize = 2) {
$sequences = array();
$frequences = array();
$len = count($input);
for ($i=0; $i<$len; $i++) {
$offset = $i;
for ($j=$i+$minimalSequenceSize; $j<$len; $j++) {
if ($input[$offset] == $input[$j]) {
$sequenceSize = 1;
$sequence = array($input[$offset]);
while (($offset + $sequenceSize < $j)
&& ($input[$offset+$sequenceSize] == $input[$j+$sequenceSize])) {
if (false !== ($seqIndex = array_search($sequence, $frequences))) {
// we already have this sequence, since we found a bigger one, remove the old one
array_splice($sequences, $seqIndex, 1);
array_splice($frequences, $seqIndex, 1);
}
$sequence[] = $input[$offset+$sequenceSize];
$sequenceSize++;
}
if ($sequenceSize >= $minimalSequenceSize) {
if (false !== ($seqIndex = array_search($sequence, $sequences))) {
$frequences[$seqIndex]++;
} else {
$sequences[] = $sequence;
$frequences[] = 2; // we have two occurances already
}
// $i += $sequenceSize; // move $i so we don't reuse the same sub-sequence
break;
}
}
}
}
// remove sequences that are sub-sequence of another frequence
// ** comment this to keep all sequences regardless **
$len = count($sequences);
for ($i=0; $i<$len; $i++) {
$freq_i = $sequences[$i];
for ($j=$i+1; $j<$len; $j++) {
$freq_j = $sequences[$j];
$freq_inter = array_intersect($freq_i, $freq_j);
if (count($freq_inter) != 0) {
$len--;
if (count($freq_i) > count($freq_j)) {
array_splice($sequences, $j, 1);
array_splice($frequences, $j, 1);
$j--;
} else {
array_splice($sequences, $i, 1);
array_splice($frequences, $i, 1);
$i--;
break;
}
}
}
}
return array($sequences, $frequences);
};
Test case
header('Content-type: text/plain');
$input = array(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 3, 5, 65, 4, 7, 13, 32, 5, 48, 4, 7, 13);
list($sequences, $frequences) = getFrequences($input);
foreach ($sequences as $i => $s) {
echo "(" . implode(',', $s) . ') f=' . $frequences[$i] . "\n";
}
** EDIT ** : here's an update to the function. It was almost completely rewritten... tell me if this is what you were looking for. I also added a redundancy check to prevent counting the same sequence, or subsequence, twice.
function getFrequences2($input, $minSequenceSize = 2) {
$sequences = array();
$last_offset = 0;
$last_offset_len = 0;
$len = count($input);
for ($i=0; $i<$len; $i++) {
for ($j=$i+$minSequenceSize; $j<$len; $j++) {
if ($input[$i] == $input[$j]) {
$offset = 1;
$sub = array($input[$i]);
while ($i + $offset < $j && $j + $offset < $len) {
if ($input[$i + $offset] == $input[$j + $offset]) {
array_push($sub, $input[$i + $offset]);
} else {
break;
}
$offset++;
}
$sub_len = count($sub);
if ($sub_len >= $minSequenceSize) {
// $sub must contain more elements than the last sequence found
// otherwise we will count the same sequence twice
if ($last_offset + $last_offset_len >= $i + $sub_len) {
// we already saw this sequence... ignore
continue;
} else {
// save offset and sub_len for future check
$last_offset = $i;
$last_offset_len = $sub_len;
}
foreach ($sequences as & $sequence) {
$sequence_len = count($sequence['values']);
if ($sequence_len == $sub_len && $sequence['values'] == $sub) {
//echo "Found add-full ".var_export($sub, true)." at $i and $j...\n";
$sequence['frequence']++;
break 2;
} else {
if ($sequence_len > $sub_len) {
$end = $sequence_len - $sub_len;
$values = $sequence['values'];
$slice_len = $sub_len;
$test = $sub;
} else {
$end = $sub_len - $sequence_len;
$values = $sub;
$slice_len = $sequence_len;
$test = $sequence['values'];
}
for ($k=0; $k<=$end; $k++) {
if (array_slice($values, $k, $slice_len) == $test) {
//echo "Found add-part ".implode(',',$sub)." which is part of ".implode(',',$values)." at $i and $j...\n";
$sequence['values'] = $values;
$sequence['frequence']++;
break 3;
}
}
}
}
//echo "Found new ".implode(',',$sub)." at $i and $j...\n";
array_push($sequences, array('values' => $sub, 'frequence' => 2));
break;
}
}
}
}
return $sequences;
};
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With