How to 'scrape' content from a page's source? [closed]

Tags:

scrape

I have this code which gets the HTML source of a page:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

I want to scrape some content from it. For example, say the page's source contains this:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

Is there a way I could scrape this from the source and store it in a variable, so it'll look like this:

technorati.com Connection failed
icerocket.com Connection failed
eblogs.com Done
Ect.

Of cause the page is dynamic which is why I'm having a problem. Could I maybe search for each site in the source? But then how would I get the result which is after it? (Connection failed / Done)
Thanks a lot for the help!

462

asked Sep 06 '11 14:09

Joey Morani

1 Answers

I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/

Then using code like this:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

This results in something like:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

111

answered Sep 23 '22 09:09

Cheesebaron

Related questions
                            
                                Unable to access phpMyAdmin after a password is set to the database
                            
                                Regex allowing spaces (for a Phone number regex)
                            
                                unable to scrape content from a website
                            
                                file_get_contents() for short urls
                            
                                Omitting Closing Php Tag [duplicate]
                            
                                How to use GROUP_CONCAT with Zend Framework?
                            
                                PHP MySQL Update minus variable without separate SELECT Query
                            
                                .htaccess; no extension required
                            
                                Create objects on the fly without variable assignment with PHP
                            
                                PHP - Expose own size to client (so the client knows how much it is downloading)
                            
                                Is it possible to re-use a Kohana ORM query for the row count?
                            
                                doctrine 2 and zend paginator
                            
                                How to generate this kind of random curves?
                            
                                Javascript confirm delete
                            
                                Insert current date to the database?
                            
                                APC - how to handle GC cache warnings?
                            
                                Is calling array() without arguments of any use?
                            
                                can i use PhoneGap Jquery to make ajax calls?
                            
                                PHP XDebug disable breaking on each request
                            
                                PHP: Cleanest way to modify multidimensional array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With