Here's a quick Perl question: How can I convert HTML special characters like <code>&uuml;</code> or <code>&#039;</code> to normal ASCII text? I started with something like this: <pre class="prettyprint"><code>s/\&#(\d+);/chr($1)/eg; </code></pre> and could write it for all HTML characters, but some function like this probably already exists? Note that I don't need a full HTML->Text converter. I already parse the HTML with the <code>HTML::Parser</code>. I just need to convert the text with the special chars I'm getting.

Take a look at HTML::Entities: <pre class="prettyprint"><code>use HTML::Entities; my $html = "Snoopy &amp; Charlie Brown"; print decode_entities($html), "\n"; </code></pre> You can guess the output.

How can I decode HTML entities?

Q: How do you show entities in HTML?

You have to use HTML character entities &lt; and &gt; in place of the < and > symbols so they aren't interpreted as HTML tags.

Tags:

html

ascii

perl

special-characters

Here's a quick Perl question:

How can I convert HTML special characters like ü or ' to normal ASCII text?

I started with something like this:

s/\&#(\d+);/chr($1)/eg;

and could write it for all HTML characters, but some function like this probably already exists?

Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.

980

asked Feb 22 '09 23:02

Frank

2 Answers

Take a look at HTML::Entities:

use HTML::Entities;  my $html = "Snoopy &amp; Charlie Brown";  print decode_entities($html), "\n";

You can guess the output.

166

answered Sep 21 '22 22:09

Telemachus

The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.

Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:

use Text::Unidecode qw(unidecode); use HTML::Entities qw(decode_entities);  my $source = '&#21271;&#20144;';   print unidecode(decode_entities($source));  # That prints: Bei Jing

answered Sep 25 '22 22:09

Mark Fowler

Related questions
                            
                                Responsive Table cell to new line
                            
                                IE11 using svg as background-image fails
                            
                                Textarea max-width
                            
                                Use internal links in RMarkdown HTML output
                            
                                CSS selector for element within element with inline style?
                            
                                How to call two methods on button's onclick method in HTML or JavaScript?
                            
                                Change the Value of h1 Element within a Form with JavaScript
                            
                                How to get rid of white space between css horizontal list items? [duplicate]
                            
                                Binding an Enum to a DropDownList in MVC 4? [duplicate]
                            
                                How to scroll at top of the page in ionic
                            
                                How should I express fractions like 15/16ths in HTML?
                            
                                Images not displaying in Github Pages?
                            
                                Html: Difference between cell spacing and cell padding [closed]
                            
                                HTML 5 difference input id and input name? [duplicate]
                            
                                React i18n break lines in JSON String
                            
                                IE11 flexbox max-width and margin:auto;
                            
                                How to add a <script> element to the DOM and execute its code?
                            
                                tailwind use font from local files globally
                            
                                CSS Hide Text But Show Image?
                            
                                Apply Calibri (Body) font to text

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With