I have an ANSI encoded text file that should not have been encoded as ANSI as there were accented characters that ANSI does not support. I would rather work with UTF-8.
Can the data be decoded correctly or is it lost in transcoding?
What tools could I use?
Here is a sample of what I have:
ç é
I can tell from context (café should be café) that these should be these two characters:
ç é
Right-click at somewhere on web page to manually set character encoding. The selected character set will automatically apply to all pages on the same site. Select "Use page default" to cancel it.
Go to "File" -> "Options" -> "Advanced" and scroll down until the "General" section is reached. In the "General" section, check the box that says "Confirm file format conversion on open." Exit Word, and reopen the corrupt document again. The dialogue box will appear.
Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.
Follow these steps with Notepad++
1- Copy the original text
2- In Notepad++, open new file, change Encoding -> pick an encoding you think the original text follows. Try as well the encoding "ANSI" as sometimes Unicode files are read as ANSI by certain programs
3- Paste
4- Then to convert to Unicode by going again over the same menu: Encoding -> "Encode in UTF-8" (Not "Convert to UTF-8") and hopefully it will become readable
The above steps apply for most languages. You just need to guess the original encoding before pasting in notepad++, then convert through the same menu to an alternate Unicode-based encoding to see if things become readable.
Most languages exist in 2 forms of encoding: 1- The old legacy ANSI (ASCII) form, only 8 bits, was used initially by most computers. 8 bits only allowed 256 possibilities, 128 of them where the regular latin and control characters, the final 128 bits were read differently depending on the PC language settings 2- The new Unicode standard (up to 32 bit) give a unique code for each character in all currently known languages and plenty more to come. if a file is unicode it should be understood on any PC with the language's font installed. Note that even UTF-8 goes up to 32 bit and is just as broad as UTF-16 and UTF-32 only it tries to stay 8 bits with latin characters just to save up disk space
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With