Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP HTML Tidy: size limit to buffer

Tags:

php

htmltidy

I'm trying to use the HTML Tidy implementation that's part of PHP (http://www.php.net/manual/en/book.tidy.php) in order to reformat a large chunk of HTML. I'm having a problem wherein Tidy is truncating the output past a certain point (about 8K).

When I create a string that's about 10K long & hand it to tidy_repair_string, like so:

$output = tidy_repair_string($output, array( 
    'indent' => true, // enforce indentation 
    'hide-comments' => true, // Remove the comments 
    'wrap' => 100, // Break each line after 100 chars 
    'output-html' => true, // Output as HTML 
    'char-encoding' => $encoding // The input/output encoding 
), $encoding); 

I lops off everything after 8,070 characters. If I pad the beginning of the string with, say, 10 characters then exactly 10 characters are left off the end.
Is there a way to change the buffer size for tidy_repair_string, so that it's bigger?

Looking at http://www.php.net/manual/en/tidy.getconfig.php there doesn't appear to be a config option for it, Google is remarkably unhelpful/my Google-fu has failed me, and there's not a ton of documentation around this. Any help would be greatly appreciated!

EDIT: I'm using xampp-portable-lite-win32-1.8.1-VC9 on Windows 7. The problem continues to happen even when I change php.ini to use memory_limit = 900M

like image 394
MikeTheTall Avatar asked Apr 04 '13 16:04

MikeTheTall


1 Answers

All right, I can think of a couple reasons why this could possibly be failing.

  1. You've exceeded your memory limit with not just this function call, but loading the variable into memory and any pre-processing you're doing. To test this out, you could try increasing your memory limit in php.ini to something ungodly high, or you could use memory_get_usage(). Run it once before creating your object, then again after creating your object, and take the difference between the two results. (How to find memory used by an object in PHP? (sizeof))

  2. PHP tidy is bootstrapping on a version of Linux's tidy program. I know that a while back, the program had a limit of 4096 characters put into it at once (http://www.autoitscript.com/forum/topic/129973-tidy-4096-char-limit/), but it looks as though that error has been fixed. What I'd recommend just to test that theory though is to echo out your 10K string (it'll take a minute) and then run that straight through bash's tidy program. I decided to test this theory myself:

    From BASH, echo $(python -c 'print 20000*"a"') > test_file. Since a char is 1 byte, this command should create a file for us that is 20K. Obviously, this won't validate with tidy, but it's some nice junk text that I can throw at the program. Now feed it into tidy (If you don't have tidy on the command line, sudo apt-get install tidy) with tidy < test_file. For me, this doesn't fail, but maybe give it a try. If it doesn't fail, then it isn't specific to the bootstrapped bash tidy program.

    *Now we've eliminated php.ini and the actual bash tidy program as the problems.

  3. I then tried to recreate your error.

    I started out using the comment from above, parsing a file rather than a string.

    <?PHP
    $output = tidy_repair_file("test_file");
    
    print strlen($output);
    ?>
    

    For the tidy_repair_file strlen, I got 20111 (where the additional 111 characters come from tidy formatting. No truncation. Then I tried to read it into active memory and parse it as a string.

    <?PHP
    $data = readfile("test_file"); //read a 20K file into active memeory
    
    $encoding = "ascii"; //I just set my encoding to 'ascii' because I like it...
    
    $output = tidy_repair_string($data, array(
    'indent' => true, // enforce indentation
    'hide-comments' => true, // Remove the comments
    'wrap' => 100, // Break each line after 100 chars
    'output-html' => true, // Output as HTML
    'char-encoding' => $encoding // The input/output encoding
    ), $encoding);
    
    print strlen($output);
    ?>
    

I obviously am doing something wrong here, because I get my junk file echoed back to me, then '132', which is a basic HTML file:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
  <head>
    <title></title>
  </head>
  <body>
    20001
  </body>
</html>

While I'm doing something wrong, this output does tell me that I am parsing a 20K file without truncation.

It's also worthwhile to note that I tried this code both using php test.php from the prompt and running it through a web browser. I get the same results. No truncation. It's also noteworthy for me to disclose that I'm running this out of Ubuntu Server, not Windows IIS.

Try outputting your variable to a file and then run tidy_repair_file() against it. Obviously, this solution is not sustainable and won't scale, but it will inform you of whether or not it's a problem with the original string.

Also, try running strlen() on $output before and after your tidy call - make sure that your string is a 10K string before it ever hits tidy...just as a sanity check.

Good luck, and I hope some of this helps!

like image 186
TopherGopher Avatar answered Sep 21 '22 02:09

TopherGopher