ASCII code 131 = â ( letter a with circumflex accent or a-circumflex ) ( HTML entity = â )ASCII code 132 = ä ( letter a with umlaut or diaeresis , a-umlaut )

Each line is a string <pre class="prettyprint"><code>Â&nbsp;4 Â&nbsp;minutes Â&nbsp;12 Â&nbsp;minutes Â&nbsp;16 Â&nbsp;minutes </code></pre> I was able to remove the <code>Â</code> successfully using <code>str_replace</code> but not the HTML entity. I found this question: How to remove html special chars? But the preg_replace did not do the job. How can I remove the HTML entity and that A? Edit: I think I should have said this earlier: I am using <code>DOMDocument::loadHTML()</code> and <code>DOMXpath</code>. Edit: Since this seems like an encoding issue, I should say that this is actually all separate strings.

Alright - I think I've got a handle on this now - I want to expand on some of the encoding errors that people are getting at: This seems to be an advanced case of Mojibake, but here is what I think is going on. MikeAinOz's original suspicion that this is UTF-8 data is probably true. If we take the following UTF-8 data: <code>4&nbsp;minutes</code> Now, remove the HTML entity, and replace it with the character it actually corresponds with: U+00A0. (It's a non-breaking space, so I can't exactly "show" you. You get the string: "4 minutes". Encode this as UTF-8, and you get the following byte sequence: <pre class="prettyprint"><code>characters: 4 [nbsp] m i n ... bytes : 34 C2 A0 6D 69 6E ... </code></pre> (I'm using [nbsp] above to mean a literal non-breaking space (the character, not the HTML entity <code>&nbsp;</code>, but the character that represents. It's just white-space, and thus, difficult.) Note that the [nbsp]/U+00A0 (non-breaking space) takes 2 bytes to encode in UTF-8. Now, to go from byte stream back to readable text, we should decode using UTF-8, since that's what we encoded in. Let us use ISO-8859-1 ("latin1") - if you use the wrong one, this is almost always it. <pre class="prettyprint"><code>bytes : 34 C2 A0 6D 69 6E ... characters: 4 Â [nbsp] m i n ... </code></pre> And switch the raw non-breaking space into its HTML entity representation, and you get what you have. So, either your PHP stuff is interpreting your text in the wrong character set, and you need to tell it otherwise, or you are outputting the result somehow in the wrong character set. More code would be useful here -- where are you getting the data you're passing to this loadHTML, and how are you going about getting the output you're seeing? <hr> Some background: A "character encoding" is just a means of going from a series of characters, to a series of bytes. What bytes represent "é"? UTF-8 says <code>C3 A9</code>, whereas ISO-8859-1 says <code>E9</code>. To get the original text back from a series of bytes, we must know what we encoded it with. If we decode <code>C3 A9</code> as UTF-8 data, we get "é" back, if we (mistakenly) decode it as ISO-8859-1, we get "Ã©". Junk. In psuedo-code: <pre class="prettyprint"><code>utf8-decode ( utf8-encode ( text-data ) ) // OK iso8859_1-decode ( iso8859_1-encode ( text-data ) ) // OK iso8859_1-decode ( utf8-encode ( text-data ) ) // Fails utf8-decode ( iso8859_1-encode ( text-data ) ) // Fails </code></pre> This isn't PHP code, and isn't your fix... it's just the crux of the problem. Somewhere, over the large scale, that's happening, and things are confused.

Why can't I get rid of this Â ?

Q: What is this â € œ?

This answer is not useful. Show activity on this post. â€&oelig; is "Mojibake" for “ . You could try to avoid the non-ascii quotes, but that would only delay getting back into trouble.

Tags:

php

encoding

Each line is a string

Â&nbsp;4 
Â&nbsp;minutes 
Â&nbsp;12
Â&nbsp;minutes
Â&nbsp;16
Â&nbsp;minutes

I was able to remove the Â successfully using str_replace but not the HTML entity. I found this question: How to remove html special chars?

But the preg_replace did not do the job. How can I remove the HTML entity and that A?

Edit: I think I should have said this earlier: I am using DOMDocument::loadHTML() and DOMXpath. Edit: Since this seems like an encoding issue, I should say that this is actually all separate strings.

545

asked Aug 30 '10 00:08

Strawberry

1 Answers

Alright - I think I've got a handle on this now - I want to expand on some of the encoding errors that people are getting at:

This seems to be an advanced case of Mojibake, but here is what I think is going on. MikeAinOz's original suspicion that this is UTF-8 data is probably true. If we take the following UTF-8 data:

4 minutes

Now, remove the HTML entity, and replace it with the character it actually corresponds with: U+00A0. (It's a non-breaking space, so I can't exactly "show" you. You get the string: "4 minutes". Encode this as UTF-8, and you get the following byte sequence:

characters:  4  [nbsp]   m   i   n ...
bytes     : 34  C2  A0  6D  69  6E ...

(I'm using [nbsp] above to mean a literal non-breaking space (the character, not the HTML entity  , but the character that represents. It's just white-space, and thus, difficult.) Note that the [nbsp]/U+00A0 (non-breaking space) takes 2 bytes to encode in UTF-8.

Now, to go from byte stream back to readable text, we should decode using UTF-8, since that's what we encoded in. Let us use ISO-8859-1 ("latin1") - if you use the wrong one, this is almost always it.

bytes     : 34  C2      A0  6D  69  6E ...
characters:  4   Â  [nbsp]   m   i   n ...

And switch the raw non-breaking space into its HTML entity representation, and you get what you have.

So, either your PHP stuff is interpreting your text in the wrong character set, and you need to tell it otherwise, or you are outputting the result somehow in the wrong character set. More code would be useful here -- where are you getting the data you're passing to this loadHTML, and how are you going about getting the output you're seeing?

Some background: A "character encoding" is just a means of going from a series of characters, to a series of bytes. What bytes represent "é"? UTF-8 says C3 A9, whereas ISO-8859-1 says E9. To get the original text back from a series of bytes, we must know what we encoded it with. If we decode C3 A9 as UTF-8 data, we get "é" back, if we (mistakenly) decode it as ISO-8859-1, we get "Ã©". Junk. In psuedo-code:

utf8-decode ( utf8-encode ( text-data ) )           // OK
iso8859_1-decode ( iso8859_1-encode ( text-data ) ) // OK
iso8859_1-decode ( utf8-encode ( text-data ) )      // Fails
utf8-decode ( iso8859_1-encode ( text-data ) )      // Fails

This isn't PHP code, and isn't your fix... it's just the crux of the problem. Somewhere, over the large scale, that's happening, and things are confused.

answered Oct 01 '22 13:10

Thanatos

Related questions
                            
                                Possible to use multiple/nested MySQLi statements?
                            
                                Apache uses excessive CPU
                            
                                PHP: mysql_connect() won't work via command line
                            
                                benefits of "HTTP authentication with PHP"
                            
                                Codeigniter:$query->free_result() when using active record?
                            
                                file_get_contents from url that is only accessible after log-in to website
                            
                                Should I cache Gravatar icon or access image directly?
                            
                                How to continue process after responding to ajax request in PHP?
                            
                                Recaptcha - Form Customization
                            
                                PHPDoc for variable-length arrays of arguments
                            
                                Storing important secret keys in php files
                            
                                PHP: need json_encode() 5.3 functionality in 5.2
                            
                                How to know the number of seeds/peers for a torrent in PHP
                            
                                PHP SOAP Transferring Files
                            
                                coding standards in yii framework
                            
                                Magento API order id vs. increment id
                            
                                Is it better to store redundant information or join tables when necessary in MySQL?
                            
                                Can a class instance self-destruct?
                            
                                smtp configuration for php mail
                            
                                PHP Ternary operator clarification

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why can't I get rid of this Â ?

Tags:

php

encoding

Strawberry

People also ask

1 Answers

Thanatos

Recent Activity

Donate For Us

Why can't I get rid of this Â&nbsp;?

Tags:

php

encoding

Strawberry

People also ask

1 Answers

Thanatos

Related questions

Recent Activity

Donate For Us

Why can't I get rid of this Â ?