I'm interested in unescaping text for example: \
maps to \
in C. Does anyone know of a good library?
As reference the Wikipedia List of XML and HTML Character Entity References.
Wikipedia has a good expalanation of character encodings and how some characters should be represented in HTML. Load the HTML data to decode from a file, then press the 'Decode' button: Browse: Alternatively, type or paste in the text you want to HTML–decode, then press the 'Decode' button.
HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and > are encoded as < and > for HTTP transmission.
HtmlDecode(String, TextWriter)Converts a string that has been HTML-encoded into a decoded string, and sends the decoded string to a TextWriter output stream.
& is HTML for "Start of a character reference". & is the character reference for "An ampersand". ¤t; is not a standard character reference and so is an error (browsers may try to perform error recovery but you should not depend on this).
For another open source reference in C to decoding these HTML entities you can check out the command line utility uni2ascii/ascii2uni. The relevant files are enttbl.{c,h} for entity lookup and putu8.c which down converts from UTF32 to UTF8.
uni2ascii
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With