Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Indextank for a site search

Im looking for free easy-to-implement & ad-free alternatives to Google CSE.

I found indextank, which looks like an easy enough way to index content but it doesn't crawl your site. I guess I envisaged being able to pass it an url ala Google CSE.

Therefore, is there an easy way I could setup a PHP script to do the crawling part? i.e. pass it an URL and have it index all webpages on that domain.

End result being I can put a site search on my website.

like image 439
gio Avatar asked Nov 14 '22 22:11

gio


1 Answers

I implemented this functionality in my site. Basically I have an HTML form where the user can query:

<form method="post" action="[_LINK_HELP_SEARCH_]">
  <div class="static-text">(_INTRO_)</div>
  <input class="inline" name="q" id="search" type="text" value="[_QUERY_]" />
  <input class="inline" type="submit" value="(_SEARCH_)" />
  <div class="micro-text">(_EXAMPLE_)</div>
</form>

Note: All [XXX] and (YYY) are template fields, you should substitute in your code.

When the form is sent, a PHP file split the query in words:

$query = preg_replace('/\s{2,}/', ' ', $query);
$words = explode(' ', $query);

Search for every file in the target folder (

$help_files = _get_all_files('help');
$help_files = array_slice($help_files, 0, MAX_RESULTS);
foreach($help_files as $file) {

Note that I search only on the 'help' folder, you should adapt this to your own needs. Note also that _get_all_files is a custom function that just lists all PHP files on the given folder.

Then load and parse text:

$text_file = '';
$filename = $file['page'];
if (_file_exists($filename)) {
    $text_file = _read_php_file($filename);
}

$text_file = strtolower($text_file);
$text_file = strip_tags($text_file);
$text_file = preg_replace('/\[_(.*?)_\]/', '...', $text_file);
$text_file = preg_replace(array('/\s{2,}/', '[\t\n]'), ' ', $text_file);

Note here that _read_php_file reads the PHP content file, i.e. just the same that a user will get if he calls this file. This is because I use templates and my HTML files are not direct. If you use static HTML, you can use readfile() or similar.

Next, search words:

$score = 0;
foreach ($words as $word) {
    if (strpos($text_file, $word) !== false) {
        $score++;
    }
}

I know it could be optimized, but that wasn't necessary for the moment. Basically, this piece of code counts each word found in the text and gets an score.

Next you may be interested in creating a text excerpt:

$pos = strpos($text_file, $words[0]);
$cut_ini = max($pos - RESUME_LIMIT/2, 0);
$extract = substr($text_file, $cut_ini, RESUME_LIMIT);
$extract = "...$extract...";

And last, I store all this info in the output array (for each file found), if score is significative:

if (($score > 0) && (count($words) / $score > 0.7)) {
    $result = array (
        'extract'   => $extract,
        'title'     => $file['title'],
        'link'      => $file['page'],
        'score'     => $score
    );
    $results[] = $result;
}

Of course all this must be reapeated for each file you want to index and at the end, you must sort your array:

usort($results, "_search_sort");

With this function:

function _search_sort($a, $b) {
    if ($a['score'] == $b['score']) {
        return 0;
    }
    return ($a['score'] > $b['score']) ? -1 : 1;
}

At the end you will have a sorted array with your searching results. I hope this helps.

like image 97
Ivan Avatar answered Nov 30 '22 23:11

Ivan