Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use cURL to fetch specific data from a website and then save it my database using php

Tags:

php

mysql

can any one tell me how to use curl or file_get_contents for downloading specific data from a website and then save those specific data into my mysql database. I want to get latest additions of films from this website http://www.traileraddict.com/ and i want to save it in my database(on a daily basis; this text and html link will be shown on my website). I just need the text and html link.(highlighted in the pic)

enter image description here

i have searched everywhere but i didnt find any useful tutorial. i have two main questions to ask

1) How can i get specific data using cURL or file_get_contents.

2) How can i save the specific content to my mysql database table( text in one column and link in another column)

like image 561
Eka Avatar asked Mar 01 '12 12:03

Eka


2 Answers

Using cURL:

$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, 'http://www.something.com');
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);

$content = curl_exec($ch);

Then you can load the element into a DOM Object and parse the dom for the specific data. You could also try and parse the data using search strings, but using regex on HTML is highly frowned upon.

$dom = new DOMDocument();
$dom->loadHTML( $content );

// Parse the dom for your desired content
  • http://www.php.net/manual/en/class.domdocument.php
like image 117
SS44 Avatar answered Oct 26 '22 20:10

SS44


This should work but it's messy and possible it will break if the site you are scraping happens to change it's markup which will affect the scraping:

$sites[0] = 'http://www.traileraddict.com/';

// use this if you want to retrieve more than one page:
// $sites[1] = 'http://www.traileraddict.com/trailers/2';


foreach ($sites as $site)
{
    $ch = curl_init($site);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $html = curl_exec($ch);


    // ok, you have the whole page in the $html variable
    // now you need to find the common div that contains all the review info
    // and that appears to be <div class="info"> (I think you could use abstract aswell)
    $title_start = '<div class="info">';

    $parts = explode($title_start,$html);

    // now you have an array of the info divs on the page

    foreach($parts as $part){

    // so now you just need to get your title and link from each part

    $link = explode('<a href="/trailer/', $part);

    // this means you now have part of the trailer url, you just need to cut off the end which you don't need:

   $link = explode('">', $link[1]);

   // this should give something of the form:
   // overnight-2012/trailer
   // so just make an absolute url out of it:

   $url = 'http://www.traileraddict.com/trailer/'.$link[0];

  // now for the title we need to follow a similar process:

  $title = explode('<h2>', $part);

  $title = explode('</h2>', $title[1]);

  $title = strip_tags($title[0]);

  // INSERT DB CODE HERE e.g.

  $db_conn = mysql_connect('$host', '$user', '$password') or die('error');
  mysql_select_db('$database', $db_conn) or die(mysql_error());

 $sql = "INSERT INTO trailers(url, title) VALUES ('".$url."', '".$title."')"

 mysql_query($sql) or die(mysql_error()); 

}

That should be it, now you have a variable for the link and title that you can insert into your database.

DISCLAIMER

I have written this from the top of my head at work so I apologise if it doesn't work straight off the bat but let me know if it doesn't and I will try and help further.

ALSO, I am aware this could be done smarter and using less steps but that would involve more thinking on my part and the OP can do this if they wish once they have understood the code I have written, since I would assume it would be a lot more important that they understand what I have done and be able to edit it themselves.

Also, I would advise scraping the site at night so as not to burden it with extra traffic and I would suggest asking for the permission of that site aswell since if they catch you they will be able to put an end to your scraping :(

To answer your final point - to run this at a set time period you would use a cron job.

like image 26
martincarlin87 Avatar answered Oct 26 '22 19:10

martincarlin87