Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this A0 character appearing in my HTML::Element output?

Tags:

encoding

perl

I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element. For some reason whenever the content of a tag is just  , which is to be expected, it gets returned by HTML::Element as a strange character I've never seen before:

alt text http://www.freeimagehosting.net/uploads/2acca201ab.jpg

I can't copy the character so can't Google it, couldn't find it in character map, and strangely when I search with a regular expression, \w finds it. When I convert the returned document to ANSI or UTF-8 it disappears altogether. I couldn't find any info on it in the HTML::Element documentation either.

How can I detect and replace this character with something more useful like null and how should I deal with strange characters like this in the future?

like image 708
RobbR Avatar asked Sep 19 '09 17:09

RobbR


People also ask

What causes  in HTML?

Getting weird characters like  instead of or ’? Most likely there is a Character set problem. It can occur when a MySQL and PHP are upgraded or when data has been incorrectly stored or the application is sending an incorrect (or missing) character set to the browser.


2 Answers

The character is "\xa0" (i.e. 160), which is the standard Unicode translation for  . (That is, it's Unicode's non-breaking space.) You should be able to remove them with s/\xa0/ /g if you like.

like image 68
chaos Avatar answered Sep 23 '22 18:09

chaos


The character is non-breaking space which is what   stands for:

In word processing and digital typesetting, a non-breaking space (" ") (also called no-break space, non-breakable space (NBSP), hard space, or fixed space) is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.

In HTML, the common non-breaking space, which is the same width as the ordinary space character, is encoded as   or  . In Unicode, it is encoded as U+00A0.

like image 23
Sinan Ünür Avatar answered Sep 20 '22 18:09

Sinan Ünür