Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape web page contents

I am developing a project, for which I want to scrape the contents of a website in the background and get some limited content from that scraped website. For example, in my page I have "userid" and "password" fields, by using those I will access my mail and scrape my inbox contents and display it in my page.

I done the above by using javascript alone. But when I click the sign in button the URL of my page (http://localhost/web/Login.html) is changed to the URL (http://mail.in.com/mails/inbox.php?nomail=....) which I am scraped. But I scrap the details without changing my url.

like image 308
Sakthivel Avatar asked Feb 25 '09 05:02

Sakthivel


People also ask

Can you legally scrape a website?

Web scraping is legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data. Respect your target websites and use empathy to create ethical scrapers.

Can you scrape dynamic content from a website?

There are two approaches to scraping a dynamic webpage: Scrape the content directly from the JavaScript. Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.

Is Web scraping easy?

The answer to that question is a resounding YES! Web scraping is easy! Anyone even without any knowledge of coding can scrape data if they are given the right tool. Programming doesn't have to be the reason you are not scraping the data you need.


1 Answers

Definitely go with PHP Simple HTML DOM Parser. It's fast, easy and super flexible. It basically sticks an entire HTML page in an object then you can access any element from that object.

Like the example of the official site, to get all links on the main Google page:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';
like image 162
givp Avatar answered Oct 14 '22 03:10

givp