Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How might one go about implementing a forward index in PHP?

I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.

Let us make a few basic assumptions:

  1. The entire Interweb consists of about five thousand HTML and/or plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.

  2. The result of our awesome PHP-based forward indexing algorithm should be along the lines of:

    UID1 -> index.html -> helen,she,was,champion,with,freckles

    UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep

    UID2 -> blah.html -> next,week,on,badgerwatch

    UID2 -> gah.txt -> one,one,and,one,is,not,numberwang

Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging. Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:

  1. Extracting the real textual content stuff within the document as a list of words in the order in which they are presented.
  2. All the while, ignoring any garbage such as <script> and <html> tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care.
  3. Bear in mind a solution that can build the list of words WHILE reading the document is cooler that one which needs to read in the whole document first.

At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of 'print' statements will suffice.

Thanks in advance, hope this was clear enough.

like image 367
karim79 Avatar asked Apr 27 '09 22:04

karim79


People also ask

How to Redirect index PHP?

Redirection in PHP can be done using the header() function. To setup, a simple redirect simply creates an index. php file in the directory you wish to redirect from with the following content: <?

What is a forward index?

The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

What is forward indexing and backward indexing?

Forward indexing starts form 0, 1, 2…. Whereas, backward indexing starts form −1, −2, −3…, where −1 is the last element in a string, −2 is the second last, and so on. We can only use the integer number type for indexing; otherwise, the TypeError will be raised.

What's the difference between a forward index and an inverted index?

A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page. The inverted index is the list of words, and the documents in which they appear.


2 Answers

Take a look at

http://simplehtmldom.sourceforge.net/

You do somthing like

$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;

And that will give you all the text. Want to iterate over just the links

foreach ($p->find("a") as $link)
{
    echo $link->innerText;
}

It is very usefull and powerfull. Check it out.

like image 112
Byron Whitlock Avatar answered Oct 10 '22 21:10

Byron Whitlock


I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:

  1. Run the page through Tidy (a good introduction) to make sure it's going to have valid HTML.
  2. Throw away everything before (and including) <body>.
  3. Step through the document one character at a time.
    1. If the character is a '<', don't do anything with the following characters until you see a '>' (skips HTML)
    2. If the character is a "word character" (alphanumeric, hyphen, possibly more) append it to the "current word".
    3. If the character is a "non-word character" (punctuation, space, possibly more), add the "current word" to the word list in the forward index, and clear the "current word".
  4. Do the above until you hit </body>.

That's really about it, you might have to add in some exceptions for handling things like <script> tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.

like image 44
Chad Birch Avatar answered Oct 10 '22 20:10

Chad Birch