Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I decode HTML entities?

Here's a quick Perl question:

How can I convert HTML special characters like ü or ' to normal ASCII text?

I started with something like this:

s/\&#(\d+);/chr($1)/eg; 

and could write it for all HTML characters, but some function like this probably already exists?

Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.

like image 980
Frank Avatar asked Feb 22 '09 23:02

Frank


People also ask

What is HTML entity decode?

HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and > are encoded as &lt; and &gt; for HTTP transmission.

How do you decode HTML?

Wikipedia has a good expalanation of character encodings and how some characters should be represented in HTML. Load the HTML data to decode from a file, then press the 'Decode' button: Browse: Alternatively, type or paste in the text you want to HTML–decode, then press the 'Decode' button.

How do you show entities in HTML?

You have to use HTML character entities &lt; and &gt; in place of the < and > symbols so they aren't interpreted as HTML tags.


2 Answers

Take a look at HTML::Entities:

use HTML::Entities;  my $html = "Snoopy &amp; Charlie Brown";  print decode_entities($html), "\n"; 

You can guess the output.

like image 166
Telemachus Avatar answered Sep 21 '22 22:09

Telemachus


The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.

Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:

use Text::Unidecode qw(unidecode); use HTML::Entities qw(decode_entities);  my $source = '&#21271;&#20144;';   print unidecode(decode_entities($source));  # That prints: Bei Jing  
like image 37
Mark Fowler Avatar answered Sep 25 '22 22:09

Mark Fowler