Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to 'scrape' content from a page's source? [closed]

Tags:

php

scrape

I have this code which gets the HTML source of a page:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

I want to scrape some content from it. For example, say the page's source contains this:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

Is there a way I could scrape this from the source and store it in a variable, so it'll look like this:

technorati.com Connection failed
icerocket.com Connection failed
eblogs.com Done
Ect.

Of cause the page is dynamic which is why I'm having a problem. Could I maybe search for each site in the source? But then how would I get the result which is after it? (Connection failed / Done)
Thanks a lot for the help!

like image 462
Joey Morani Avatar asked Sep 06 '11 14:09

Joey Morani


People also ask

How do I scrape hidden data from a website?

You can use the Attribute selector to scrape these hidden tags from HTML. You can write your selector manually and then enter the “content” in the attribute name option to scrape efficiently.

How do I grab content from a website?

Open the three-dot menu on the top right and select More Tools > Save page as. You can also right-click anywhere on the page and select Save as or use the keyboard shortcut Ctrl + S in Windows or Command + S in macOS. Chrome can save the complete web page, including text and media assets, or just the HTML text.

Can you scrape websites legally?

Web scraping is legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data.


1 Answers

I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/

Then using code like this:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

This results in something like:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents
like image 111
Cheesebaron Avatar answered Sep 23 '22 09:09

Cheesebaron