PHP

Question

I am trying to sanitize a string and have ended up with the following:

Characterisation of the arsenic resistance genes in lt i gt Bacillus lt i gt sp UWC isolated from maturing fly ash acid mine drainage neutralised solids

I am trying to remove the lt, i, gt as those are reduced HTML entities which do not seem to be removed. What would be the best way to approach this or another solution that I could look at?

Here is my current solution for now:

/**
 * @return string
 */
public function getFormattedTitle()
{
    $string = preg_replace('/[^A-Za-z0-9\-]/', ' ',  filter_var($this->getTitle(), FILTER_SANITIZE_STRING));
    return $string;
}

And here is an example input string:

Assessing <i>Clivia</i> taxonomy using the core DNA barcode regions, <i>matK</i> and <i>rbcLa</i>

Thanks!

Thomas David Baker · Accepted Answer

The telltale lt and gt in your output tell me that the string you have is actually more like:

"Assessing Clivia taxonomy using the core DNA barcode regions, matK and rbcLa"

when viewed as plain text.

The string you show above is what would show in a browser which would interpret '<' as '<' and '>' as '>'. (These are usually called "HTML entities" and offer a way to encode a character that would otherwise be interpreted as HTML.)

One option is to process like this:

$s = "Assessing &lt;i&gt;Clivia&lt;/i&gt; taxonomy …";
$s = html_entity_decode($s); // $s is now "Assessing <i>Clivia</i> taxonomy …"
$s = strip_tags($s); // $s is now "Assessing Clivia taxonomy"

But do be aware that strip_tags is an exceedingly naïve function. For example it would turn '1<5 and 6>2' into '12'! So you need to be sure that all your input text is double-HTML encoded as the example is for it to work perfectly.

PHP - Remove decoded HTML entities from string

Tags:

string

replace

html-entities

liamjnorman

1 Answers

Thomas David Baker

Recent Activity

Donate For Us