Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sanitizing HTML input

I'm thinking of adding a rich text editor to allow a non-programmer to change the aspect of text. However, one issue is that it's possible to distort the layout of a rendered page if the markup is incorrect. What's a good lightweight way to sanitize html?

like image 827
James P. Avatar asked Apr 01 '11 11:04

James P.


People also ask

What does sanitizing input mean?

Sanitization may include the elimination of unwanted characters from the input by means of removing, replacing, encoding, or escaping the characters. Sanitization may occur following input (input sanitization) or before the data is passed across a trust boundary (output sanitization).

How do you disinfect a URL input?

A sanitized URL can be achieved by simply converting the input to lowercase and replacing the spaces with '%20'.

Should you sanitize user input?

The Basics. The first lesson anyone learns when setting up a web-to-database—or anything-to-database gateway where untrusted user input is concerned—is to always, always sanitize every input.


3 Answers

You will have to decide between good and lightweight. The recommended choice is 'HTMLPurifier', because it provide no-fuss secure defaults. As faster alternative it is often advised to use 'htmLawed'.

See also this quite objective overview from the HTMLPurifier author: http://htmlpurifier.org/comparison

like image 198
mario Avatar answered Sep 26 '22 05:09

mario


I really like HTML Purifier, which allows you to specify which tags and attirbutes are allowed in your HTML code -- and generates valid HTML.

like image 31
Pascal MARTIN Avatar answered Sep 23 '22 05:09

Pascal MARTIN


Use BB codes (or like here on SO), otherwise chances are very slim. Example function...

function parse($string){

    $pattern = array(
    "/\[url\](.*?)\[\/url\]/",
    "/\[img\](.*?)\[\/img\]/",
    "/\[img\=(.*?)\](.*?)\[\/img\]/",
    "/\[url\=(.*?)\](.*?)\[\/url\]/",
    "/\[red\](.*?)\[\/red\]/",
    "/\[b\](.*?)\[\/b\]/",
    "/\[h(.*?)\](.*?)\[\/h(.*?)\]/",
    "/\[p\](.*?)\[\/p\]/",    
    "/\[php\](.*?)\[\/php\]/is"
    );

    $replacement = array(
    '<a href="\\1">\\1</a>',
    '<img alt="" src="\\1"/>',
    '<img alt="" class="\\1" src="\\2"/>',
    '<a rel="nofollow" target="_blank" href="\\1">\\2</a>',
    '<span style="color:#ff0000;">\\1</span>',
    '<span style="font-weight:bold;">\\1</span>',
    '<h\\1>\\2</h\\3>',
    '<p>\\1</p>',
    '<pre><code class="php">\\1</code></pre>'
    );

    $string = preg_replace($pattern, $replacement, $string);

    $string = nl2br($string);

    return $string;

}

...

echo parse("[h2]Lorem Ipsum[/h2][p]Dolor sit amet[/p]");

Result...

<h2>Lorem Ipsum</h2><p>Dolor sit amet</p>

enter image description here

Or just use HTML Purifier :)

like image 31
Dejan Marjanović Avatar answered Sep 23 '22 05:09

Dejan Marjanović