Scrape with wildcards and php

Question

I have a hard time visualizing and conceiving away to scrape this page: http://www.morewords.com/ends-with/aw for the words themselves. Given a URL, I'd like to get the contents and then generate a php array with all the words listed, which in the source look like

<a href="/word/word1/">word1</a><br />
<a href="/word/word2/">word2</a><br />
<a href="/word/word3/">word3</a><br />
<a href="/word/word4/">word4</a><br />

There are a few ways I have been thinking about doing this, i'd appreciate if you could help me decide the most efficient way. Also, i'd appreciate any advice or examples on how to achieve this. I understand it's not incredibly complicated, but I could use the help of you advanced hackers.

Use some sort of jquery $.each() to loop through and somehow case them into a JS array, and then transcribe (probably heavily taxing)
use some sort of curl (don't really have much experience with curl)
use some sophisticated find and replace with regex.

alex · Accepted Answer

You tagged it as PHP, so here is a PHP solution :)

$dom = new DOMDocument;

$dom->loadHTMLFile('http://www.morewords.com/ends-with/aw');

$anchors = $dom->getElementsByTagName('a');

$words = array();

foreach($anchors as $anchor) {
    if ($anchor->hasAttribute('href') AND preg_match('~/word/\w+/~', $anchor->getAttribute('href'))) {
        $words[] = $anchor->nodeValue;
    }
}

CodePad.

If allow_url_fopen is disabled in php.ini, you could use cURL to get the HTML.

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.morewords.com/ends-with/aw'); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($curl);    
curl_close($curl);

Scrape with wildcards and php

Tags:

javascript

jquery

php

parsing

screen-scraping

willium

1 Answers

alex

Recent Activity

Donate For Us

Scrape with wildcards and php

Tags:

javascript

jquery

php

parsing

screen-scraping

willium

1 Answers

alex

Related questions

Recent Activity

Donate For Us