Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip all HTML attributes except for src

I'm trying to remove all tag attributes except for the src attribute. For example:

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

Would be returned as:

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

I have a regular expression to strip all attributes, but I'm trying to tweak it to leave in src. Here's what I have so far:

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');
like image 740
Ian McIntyre Silber Avatar asked Jun 08 '10 02:06

Ian McIntyre Silber


1 Answers

This might work for your needs:

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP

like image 144
gnarf Avatar answered Oct 21 '22 06:10

gnarf