How might one go about implementing a forward index in PHP?

Q: What is a forward index?

The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

Q: What is forward indexing and backward indexing?

Forward indexing starts form 0, 1, 2…. Whereas, backward indexing starts form −1, −2, −3…, where −1 is the last element in a string, −2 is the second last, and so on. We can only use the integer number type for indexing; otherwise, the TypeError will be raised.

Q: What's the difference between a forward index and an inverted index?

A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page. The inverted index is the list of words, and the documents in which they appear.

Tags:

php

indexing

parsing

I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.

Let us make a few basic assumptions:

The entire Interweb consists of about five thousand HTML and/or plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.
The result of our awesome PHP-based forward indexing algorithm should be along the lines of:

UID1 -> index.html -> helen,she,was,champion,with,freckles

UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep

UID2 -> blah.html -> next,week,on,badgerwatch

UID2 -> gah.txt -> one,one,and,one,is,not,numberwang

Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging. Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:

Extracting the real textual content stuff within the document as a list of words in the order in which they are presented.
All the while, ignoring any garbage such as <script> and <html> tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care.
Bear in mind a solution that can build the list of words WHILE reading the document is cooler that one which needs to read in the whole document first.

At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of 'print' statements will suffice.

Thanks in advance, hope this was clear enough.

367

asked Apr 27 '09 22:04

karim79

2 Answers

Take a look at

http://simplehtmldom.sourceforge.net/

You do somthing like

$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;

And that will give you all the text. Want to iterate over just the links

foreach ($p->find("a") as $link)
{
    echo $link->innerText;
}

It is very usefull and powerfull. Check it out.

112

answered Oct 10 '22 21:10

Byron Whitlock

I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:

Run the page through Tidy (a good introduction) to make sure it's going to have valid HTML.
Throw away everything before (and including) <body>.
Step through the document one character at a time.
1. If the character is a '<', don't do anything with the following characters until you see a '>' (skips HTML)
2. If the character is a "word character" (alphanumeric, hyphen, possibly more) append it to the "current word".
3. If the character is a "non-word character" (punctuation, space, possibly more), add the "current word" to the word list in the forward index, and clear the "current word".
Do the above until you hit </body>.

That's really about it, you might have to add in some exceptions for handling things like <script> tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.

answered Oct 10 '22 20:10

Chad Birch

Related questions
                            
                                php file_get_contents($url) & turns into &amp;
                            
                                SimpleSAMLphp Unable to validate Signature error
                            
                                How is 302 Redirect Happening?
                            
                                Cannot Find mysqli Class in PHP7 Installation on Windows
                            
                                Form sends GET instead of POST
                            
                                Uncaught TypeError ("no access") issue with jquery depending on browser navigation
                            
                                Codeigniter: Join not working with columns same name and id
                            
                                Change PHP version used by Composer on Windows
                            
                                Laravel: Seeding multiple unique columns with Faker
                            
                                How can I create a token for a Password Grant Client using Laravel Passport?
                            
                                PHP Startup : Unable to load dynamic library PGSQL
                            
                                Custom annotation in symfony 3 controller
                            
                                PHP Debugger will not stop at breakpoints: Eclipse & Xdebug
                            
                                Google Finance Currency Converter
                            
                                Convert date to milliseconds in laravel using Carbon
                            
                                Get Request work in postman but doesn't work in browser
                            
                                Not configuring explicitly the provider for the "guard" listener on "x" firewall is ambiguous as there is more than one registered provider
                            
                                PHP4 to PHP5 Migration [closed]
                            
                                How to start facebook app?
                            
                                Alternative to phpUnderControl - is it the best? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With