Here's a quick Perl question:
How can I convert HTML special characters like ü
or '
to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser
. I just need to convert the text with the special chars I'm getting.
HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and > are encoded as < and > for HTTP transmission.
Wikipedia has a good expalanation of character encodings and how some characters should be represented in HTML. Load the HTML data to decode from a file, then press the 'Decode' button: Browse: Alternatively, type or paste in the text you want to HTML–decode, then press the 'Decode' button.
You have to use HTML character entities < and > in place of the < and > symbols so they aren't interpreted as HTML tags.
Take a look at HTML::Entities:
use HTML::Entities; my $html = "Snoopy & Charlie Brown"; print decode_entities($html), "\n";
You can guess the output.
The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode); use HTML::Entities qw(decode_entities); my $source = '北亰'; print unidecode(decode_entities($source)); # That prints: Bei Jing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With