Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What codepage encodes a 'ç' as '?º' (0x3f 0xba)

Today I received a file from a customer that I have to read, but it contains strange characters. Using known names, I can guess the meaning of some characters.

For example:

Realname  | Encoded as   | sign  | hex
----------|--------------|-------|-------
Françios  | Fran?ºios    | ç     | 3f ba
André     | Andr??       | é     | 3f 3f
Hélène    | H??l?¿ne     | è     | 3f bf
etc.
  • I have tried all codepages (known to .Net) to import the file, and see if they contain the words I know. But no codepage gives me satisfaction.
  • Opening the file in Notepad++ thinks it is ANSI, and also shows the unwanted characters. (But it has a hex-editor plugin that is usefull).
  • Other files (from the same user & zipfile) are encoded in UTF-8.

From the guy I received the files from, I cannot expect help. (Using Google Translate) he made it clear to me that he found it very hard just to create the files, and he is using software (I believe SAP) that I do not have access to.

Is there any other way I can find the encoding of the files he just send to me?

like image 497
GvS Avatar asked Mar 11 '11 14:03

GvS


2 Answers

I can get those results if I take UTF-8 encoded text, pretend it is CP850, and then convert it to Latin-1, Windows-1252, or a similar encoding. The "?" comes from the fact that the CP850 character at 0xc3 is "├", which doesn't exist in Latin-1 or derived encodings, so the conversion replaces it with a "?".


Edit: I did a bit wider of a search using iconv, and CP437, CP862, or CP865 are better matches than CP850. Since you asked, the one-liner I used this time was:

for enc in `iconv -l`; do echo -n "$enc: "; echo -n "ç é è" | iconv -s -f $enc -t "LATIN1//TRANSLIT" 2>/dev/null; echo; done
like image 98
Anomie Avatar answered Oct 06 '22 14:10

Anomie


it should UTF-8 or UTF-16. they contains almost all regular characters. it looks like you have a decode/encode problem.

notepad++ it maybe confused, because your files do not use a Byte-Order-Mark.

how do you process your files?

try to read them as binary and then try different encodings to get a string. if you do not read them as binary, a default encoding may take place.

the "?" is a sign for that.

may be that helps out.

like image 35
mo. Avatar answered Oct 06 '22 14:10

mo.