I need to convert a list of search terms into the most efficient set of combined search terms. Any word or quoted phrase can be separated by an OR. Many terms can be combined within parentheses. ANDs can also be used.
For example, foo bar
and boo bar
share bar
, so instead of two different search terms, the can be combined as (foo OR boo) AND bar
.
Here's what the algorithm needs to do. Given this data set:
foo bar
boo bar
goo bar
hoo doo
foo manchu
moo bar
too bar
foo fighters
"blue kazoo" bar
baz
qux
quux
I want to get the following back:
(foo OR boo OR goo OR moo OR too OR "blue kazoo") AND bar
foo AND (manchu OR fighters)
hoo doo
baz OR qux OR quux
This does not work:
(foo bar) OR (boo bar) OR (goo bar) OR (foo manchu)
I'll be working in PHP, but I'll take the answer in pseudo-code, PHP or I'll convert from major languages.
Code for processing more than 2 values
<?php
function keyMultiSort(&$array,
$key,
$reverse = false,
$priority_last = false,
$save_key = true,
Callable $func = null)
{
if ($func === null)
{
$func = function ($first, $second) use ($key, $reverse, $priority_last)
{
if (!isset($first[$key]))
{
return ($reverse === false) ? -1 : 1;
}
if (!isset($second[$key]))
{
return ($reverse === false) ? 1 : -1;
}
if ($first[$key] > $second[$key])
{
return ($reverse === false) ? 1 : -1;
}
if ($first[$key] < $second[$key])
{
return ($reverse === false) ? -1 : 1;
}
if ($first[$key] === $second[$key])
{
return ($priority_last === false) ? 1 : -1;
}
return 0;
};
}
if ($save_key)
{
uasort($array, $func);
}
else
{
usort($array, $func);
}
}
$array = [
['foo', 'bar', 'test'],
['boo', 'bar'],
['goo', 'bar'],
['hoo', 'doo', 'test', 'test2'],
['foo', 'manchu'],
['moo', 'bar'],
['too', 'bar'],
['foo', 'fighters'],
['blue kazoo', 'bar', 'test'],
];
$pairs = [];
$str = '';
foreach($array as $item)
{
foreach($item as $key=>$elm)
{
foreach($item as $key2=>$elm2)
{
if($key !== $key2)
{
if(!isset($pairs[$elm]['count']))
{
$pairs[$elm]['count'] = 1;
}
else
{
$pairs[$elm]['count']++;
}
$pairs[$elm]['elements'][] = $elm2;
}
}
}
keyMultiSort($pairs, 'count', true);
}
//var_dump($pairs);
$remove = [];
foreach($pairs as $elm=>$item)
{
$remove[] = $elm;
$elements = array_diff($item['elements'], $remove);
if(empty($elements))
{
if (in_array($elm, $remove))
{
continue;
}
$str .= $elm.PHP_EOL;
}
else
{
$str .= $elm.' AND ('.implode(' OR ', array_unique($elements)).')'.PHP_EOL;
}
}
var_dump($str);
Response:
string(184) "bar AND (foo OR test OR boo OR goo OR moo OR too OR blue kazoo)
test AND (foo OR hoo OR doo OR test2 OR blue kazoo)
foo AND (manchu OR fighters)
hoo AND (doo OR test2)
doo AND (test2)
"
P.S. I hope I have correctly understood the task ...
UPDATE Added code which not ignores "single values". I changed logic:
...
['"yellow balloon"', 'foo', 'bar', 'baz', 'qut'],
...
return:
...
qut AND ("yellow balloon" OR baz)
baz AND ("yellow balloon")
...
It seems to me, for this task, that's correct (to conditions to combine more than 2 values).
function keyMultiSort(&$array,
$key,
$reverse = false,
$priority_last = false,
$save_key = true,
Callable $func = null)
{
if ($func === null)
{
$func = function ($first, $second) use ($key, $reverse, $priority_last)
{
if (!isset($first[$key]))
{
return ($reverse === false) ? -1 : 1;
}
if (!isset($second[$key]))
{
return ($reverse === false) ? 1 : -1;
}
if ($first[$key] > $second[$key])
{
return ($reverse === false) ? 1 : -1;
}
if ($first[$key] < $second[$key])
{
return ($reverse === false) ? -1 : 1;
}
if ($first[$key] === $second[$key])
{
return ($priority_last === false) ? 1 : -1;
}
return 0;
};
}
if ($save_key)
{
uasort($array, $func);
}
else
{
usort($array, $func);
}
}
$array = [
['foo', 'bar', 'test'],
['boo', 'bar'],
['goo', 'bar'],
['hoo', 'doo', 'test', 'test2'],
['foo', 'manchu'],
['moo', 'bar'],
['too', 'bar'],
['foo', 'fighters'],
['"blue kazoo"', 'bar', 'test'],
['"red panda"', 'bar', 'test'],
['"yellow balloon"', 'foo', 'bar', 'baz', 'qut'],
['"red panda"', 'fighters', 'moo'],
['"foo fighters"'],
['foo'],
['bar'],
];
$pairs = [];
$singles = [];
$str = '';
foreach ($array as $item)
{
foreach ($item as $key => $elm)
{
if(count($item) === 1)
{
$singles[$elm] = 1;
}
else
{
if (!isset($pairs[$elm]))
{
$pairs[$elm]['count'] = 0;
$pairs[$elm]['elements'] = [];
}
foreach ($item as $key2 => $elm2)
{
if ($key !== $key2)
{
$pairs[$elm]['count']++;
$pairs[$elm]['elements'][] = $elm2;
}
}
}
}
keyMultiSort($pairs, 'count', true);
}
//var_dump($pairs);exit;
$remove = [];
foreach ($pairs as $elm => $item)
{
$remove[] = $elm;
$elements = array_diff($item['elements'], $remove);
$elements = array_unique($elements);
if (!empty($elements)){
$str .= $elm.' AND ('.implode(' OR ', $elements).')'.PHP_EOL;
}
}
foreach ($singles as $elm => $item)
{
$str .= $elm.PHP_EOL;
}
var_dump($str);
Response:
string(421) "bar AND (foo OR test OR boo OR goo OR moo OR too OR "blue kazoo" OR "red panda" OR "yellow balloon" OR baz OR qut)
test AND (foo OR hoo OR doo OR test2 OR "blue kazoo" OR "red panda")
foo AND (manchu OR fighters OR "yellow balloon" OR baz OR qut)
"red panda" AND (fighters OR moo)
qut AND ("yellow balloon" OR baz)
baz AND ("yellow balloon")
test2 AND (hoo OR doo)
fighters AND (moo)
doo AND (hoo)
"foo fighters"
foo
bar
"
P.S. In my opinion, this problem does not apply to reality
I understand the logic but you really need to make the question clearer.
Anyway, I see this as a graph problem where we want to find the set of nodes that are have highest degree and can span the whole graph.
I believe if you picture it this way, you can use any data structure you like to serve the purpose. You could create an adjacency list and then find nodes with higher degree and then check to see if all elements are covered through those nodes. The matter of adding AND, OR is just simple afterwards.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With