Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

str_get_html is not loading a valid html string

I receive an html string using curl:

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html_string = curl_exec($ch);

When I echo it I see a perfectly good html as I require for my parsing needs. But, When trying to send this string to HTML DOM PARSER method str_get_html($html_string), It would not upload it (returns false from the method invocation).

I tried saving it to file and opening with file_get_html on the file, but the same thing occurs.

What can be the cause of this? As I said, the html looks perfectly fine when I echo it.

Thanks a lot.

The code itself:

$html = file_get_html("http://www.bgu.co.il/tremp.aspx");
$v = $html->find('input[id=__VIEWSTATE]');
$viewState = $v[0]->attr['value'];
$e = $html->find('input=[id=__EVENTVALIDATION]');
$event = $e[0]->attr['value'];

$html->clear(); 
unset($html);

$body = " A_STRING_THAT_CONTAINS_SOME_DATA " 

$ch = curl_init("http://www.bgu.co.il/tremp.aspx");
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$html_string = curl_exec($ch);

$file_handle = fopen("file.txt", "w");
fwrite($file_handle, $html_string);
fclose($file_handle);

curl_close($ch);

$html = str_get_html($html_string);
like image 376
Dani Avatar asked Jan 05 '13 14:01

Dani


People also ask

How to check if a string is HTML or not?

Create a RegExp which checks for the validation. RegExp should follow the rules of creating a HTML document. Example 1: In this example, a regexp is created and it is validating the HTML string as valid. | Check if a string is html or not.

Is the string <> a valid HTML tag?

Therefore, it is not a valid HTML tag. Explanation: The given string has a closing tag (>) without single or double quotes enclosed that is not allowed. Therefore, it is not a valid HTML tag. Approach: The idea is to use Regular Expression to solve this problem. The following steps can be followed to compute the answer. Get the String.

What is htmlstring and why should I Care?

I say this not to make the library seem more alluring, but because HTMLString is designed to help with a niche set of problems related to developing an HTML content editor. Unlike most HTML parsers which generate tree structures, HTMLString generates a string of characters each with its own set of tags.

How to validate whether the given string is valid HTML using JavaScript?

The task is to validate whether the given string is valid HTML or not using JavaScript. we’re going to discuss few techniques. Get the HTML string into a variable. Create a RegExp which checks for the validation. RegExp should follow the rules of creating a HTML document.


2 Answers

You curl link seems have many element(large file).

And I am parsing a string(file) as large as your link and encounter this problem.

After I saw the source code, I found the problem. It works for me !


I found that simple_html_dom.php have limit the size you read.

// get html dom from string
  function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_B     R_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
  {
           $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
           if (empty($str) || strlen($str) > MAX_FILE_SIZE)
           {
                   $dom->clear();
                   return false;
           }
           $dom->load($str, $lowercase, $stripRN);
           return $dom;
  }

you must to change the default size below (It's on the top of the simple_html_dom.php)
maybe change to 100000000 ? it's up to you.

define('MAX_FILE_SIZE', 6000000); 
like image 114
twxia Avatar answered Sep 24 '22 16:09

twxia


Did you check if the HTML is somehow encoded in a way HTML DOM PARSER doesn't expect? E.g. with HTML entities like &lt;html&gt; instead of <html> – that would still be displayed as correct HTML in your browser but wouldn't parse.

like image 36
florian h Avatar answered Sep 21 '22 16:09

florian h