Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Php webscraping using simple html dom not working when output is out of order html tags

I want to scrap some information of a webpage .It uses a table layout structure.

I want to extract the third table inside the nested table layout which contains a series of nested tables .Each publishing a result .But the code is not working

include('simple_html_dom.php');
$url = 'http://exams.keralauniversity.ac.in/Login/index.php?reslt=1';
$html = file_get_contents($url);
$result =$html->find("table", 2);
echo $result;

I Used Curl to extract website but the problem is its tags is in out of order so it cannot be extracted using simple dom element .

    function curl($url) {
            $ch = curl_init();  // Initialising cURL
            curl_setopt($ch, CURLOPT_URL,$url);    // Setting cURL's URL option with the $url variable passed into the function
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
            $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
            curl_close($ch);    // Closing cURL
            return $data;   // Returning the data from the function
        }

          function scrape_between($data, $start, $end){
        $data = stristr($data, $start); // Stripping all data from before $start
        $data = substr($data, strlen($start));  // Stripping $start
        $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
        $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
        return $data;   // Returning the scraped data from the function
    }
          $scraped_page  = curl($url);  // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable

           $scraped_data = scrape_between($scraped_page, ' </html>', '</table></td><td></td></tr>
   </table>');  
 echo $scraped_data;
 $myfile = fopen("newfile.html", "w") or die("Unable to open file!");

fwrite($myfile, $scraped_data);
fclose($myfile);

How to scrape the result and save the pdf

like image 861
codefreaK Avatar asked Nov 02 '15 09:11

codefreaK


People also ask

Is PHP good for web Scraping?

Web scraping lets you collect data from web pages across the internet. It's also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.

What is HTML DOM Parser?

The DOMParser interface provides the ability to parse XML or HTML source code from a string into a DOM Document . You can perform the opposite operation—converting a DOM tree into XML or HTML source—using the XMLSerializer interface.

What is Dom web scraping?

The web scraping can be done by targeting the selected DOM components and then processing or storing the text between that DOM element of a web page. To do the same in PHP, there is an API which parses the whole page and looks for the required elements within the DOM. It is the Simple HTML DOM Parser.


2 Answers

Simple HTML Dom can't handle that html. So first switch to this library, Then do:

require_once('advanced_html_dom.php');

$dom = file_get_html('http://exams.keralauniversity.ac.in/Login/index.php?reslt=1');

$rows = array();
foreach($dom->find('tr.Function_Text_Normal:has(td[3])') as $tr){
  $row['num'] = $tr->find('td[2]', 0)->text;
  $row['text'] = $tr->find('td[3]', 0)->text;
  $row['pdf'] = $tr->find('td[3] a', 0)->href;
  if(preg_match_all('/\d+/', $tr->parent->find('u', 0)->text, $m)){
    list($row['day'], $row['month'], $row['year']) = $m[0];
  }

  // uncomment next 2 lines to save the pdf
  // $filename = preg_replace('/.*\//', '', $row['pdf']);
  // file_put_contents($filename, file_get_contents($row['pdf']));
  $rows[] = $row;
}
var_dump($rows);
like image 153
pguardiario Avatar answered Oct 31 '22 10:10

pguardiario


Find a sample code


    ?php
        // Defining the basic cURL function
        function curl($url) {
            $ch = curl_init();  // Initialising cURL
            curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
            $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
            curl_close($ch);    // Closing cURL
            return $data;   // Returning the data from the function
        }
    ?>

    <?php
        $scraped_website = curl("http://www.example.com");  // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable
$result =$substring($scraped_website ,11,7); //change values 11,7 for table
echo $result;
    ?>
like image 33
Ananta Prasad Avatar answered Oct 31 '22 10:10

Ananta Prasad