I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element. For some reason whenever the content of a tag is just
, which is to be expected, it gets returned by HTML::Element as a strange character I've never seen before:
alt text http://www.freeimagehosting.net/uploads/2acca201ab.jpg
I can't copy the character so can't Google it, couldn't find it in character map, and strangely when I search with a regular expression, \w
finds it. When I convert the returned document to ANSI or UTF-8 it disappears altogether. I couldn't find any info on it in the HTML::Element documentation either.
How can I detect and replace this character with something more useful like null
and how should I deal with strange characters like this in the future?
Getting weird characters like  instead of or ’? Most likely there is a Character set problem. It can occur when a MySQL and PHP are upgraded or when data has been incorrectly stored or the application is sending an incorrect (or missing) character set to the browser.
The character is "\xa0"
(i.e. 160), which is the standard Unicode translation for
. (That is, it's Unicode's non-breaking space.) You should be able to remove them with s/\xa0/ /g
if you like.
The character is non-breaking space which is what
stands for:
In word processing and digital typesetting, a non-breaking space ("
") (also called no-break space, non-breakable space (NBSP), hard space, or fixed space) is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.
In HTML, the common non-breaking space, which is the same width as the ordinary space character, is encoded as or . In Unicode, it is encoded as
U+00A0
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With