I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.
Let us make a few basic assumptions:
The entire Interweb consists of about five thousand HTML and/or plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.
The result of our awesome PHP-based forward indexing algorithm should be along the lines of:
UID1 -> index.html -> helen,she,was,champion,with,freckles
UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep
UID2 -> blah.html -> next,week,on,badgerwatch
UID2 -> gah.txt -> one,one,and,one,is,not,numberwang
Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging. Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:
<script>
and <html>
tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care.At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of 'print' statements will suffice.
Thanks in advance, hope this was clear enough.
Redirection in PHP can be done using the header() function. To setup, a simple redirect simply creates an index. php file in the directory you wish to redirect from with the following content: <?
The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
Forward indexing starts form 0, 1, 2…. Whereas, backward indexing starts form −1, −2, −3…, where −1 is the last element in a string, −2 is the second last, and so on. We can only use the integer number type for indexing; otherwise, the TypeError will be raised.
A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page. The inverted index is the list of words, and the documents in which they appear.
Take a look at
http://simplehtmldom.sourceforge.net/
You do somthing like
$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;
And that will give you all the text. Want to iterate over just the links
foreach ($p->find("a") as $link)
{
echo $link->innerText;
}
It is very usefull and powerfull. Check it out.
I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:
<body>
.</body>
.That's really about it, you might have to add in some exceptions for handling things like <script>
tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With