Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.

But unfortunately, due to incorrect input of Array index in the php code, it came error as output.

site1.com

<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="65%" valign="top" class="Title2">Subject</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="8%" valign="top" align="Center" class="Title2">Replies</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>

The php. web crawler as ::

<?php
    function get_data($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL,$url);
    $result=curl_exec($ch);
    curl_close($ch);
    return $result;
    }
    $returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
    $first_step = explode( '<table class="Table2">' , $returned_content );
    $second_step = explode('</table>', $first_step[0]);
    $third_step = explode('<tr>', $second_step[1]);
    // print_r($third_step);
    foreach ($third_step as $key=>$element) {
    $child_first = explode( '<td class="FootNotes2"' , $element );
    $child_second = explode( '</td>' , $child_first[1] );
    $child_third = explode( '<a href=' , $child_second[0] );
    $child_fourth = explode( '</a>' , $child_third[0] );
    $final = "<a href=".$child_fourth[0]."</a></br>";
?>

<li target="_blank" class="itemtitle">
    <?php echo $final?>
</li>

<?php
    if($key==10){
       break;
        }
    }
?>

Now the Array Index on the above php code can be the culprit. (i guess) If so, can some one please explain me how to make this work.

But what my final requirement from this code is:: to get the above text in second with a link associated to it.

Any help is Appreciated..

like image 621
harishk Avatar asked Feb 09 '17 13:02

harishk


People also ask

What is a web crawling tool web crawler?

A web crawler, crawler or web spider, is a computer program that's used to search and automatically index website content and other information over the internet. These programs, or bots, are most commonly used to create entries for a search engine index.

Which of the following is a command line tool that acts as a robot that crawls the web site to help web developers in testing the WAP based applications?

UiPath is a robotic process automation software for free web scraping. It automates web and desktop data crawling out of most third-party Apps. You can install the robotic process automation software if you run it on Windows. Uipath is able to extract tabular and pattern-based data across multiple web pages.


1 Answers

Using the Simple HTML DOM Parser library, you can use the following code:

<?php
    require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.

    $html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');

    foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
        $element->href = "http://www.usmleforum.com" . $element->href;  // you can also access only certain attributes of the elements (e.g. the url).
        echo $element.'</br>';  // do something with the elements.
    }
?>
like image 169
MrDarkLynx Avatar answered Sep 25 '22 08:09

MrDarkLynx