Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

list of garbage characters like ’

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.

thanx

like image 579
ZafarYousafi Avatar asked Aug 19 '12 03:08

ZafarYousafi


People also ask

What are the junk characters?

i.e., any character having an ascii equivalent decimal value of more than 127 is a junk character(courtesy www.asciitable.com).

What does â € œ mean?

Save this answer. Show activity on this post. “ is "Mojibake" for “ . You could try to avoid the non-ascii quotes, but that would only delay getting back into trouble.

What is this character â?

Â, â (a-circumflex) is a letter of the Inari Sami, Skolt Sami, Romanian, and Vietnamese alphabets. This letter also appears in French, Friulian, Frisian, Portuguese, Turkish, Walloon, and Welsh languages as a variant of the letter "a".

How do I find junk characters in a text file?

You can download Notepad++ and open the file there. Then, go to the menu and select View->Show Symbol->Show All Characters . All characters will become visible, but you will have to scroll through the whole file to see which character needs to be removed.


1 Answers

Search for the term "UTF-8", because that's what you're seeing.

UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.

In your example, you started with the smart quote character . Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)

I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.

If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.

like image 131
librik Avatar answered Sep 23 '22 06:09

librik