Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.
Thanks
Is there any way to determine a string's encoding in C#? Say, I have a filename string, but I don't know if it is encoded in Unicode UTF-16 or the system-default encoding, how do I find out? You cannot "encode" in Unicode. And there is no way to automagically determine the encoding of any given String, without any other prior information.
To use String Functions’ character encoding/decoding tool, start by entering a string of characters in the text box. Then, select which encoding and decoding system you would like to use to simulate from the drop-down menus. To view encoding tables from one encoding to another, use our character encoding table index.
If you have a string it is already encoded from someone along the way who already knew or guessed the encoding to get the string in the first place. Falls back to the local default codepage if no Unicode encoding was found. Searches for charset=xyz and encoding=xyz inside file to help determine encoding.
Similarly, the IMultiLang2 interface has a function to detect the encoding of an incoming byte array. This is very handy for codepage detection of text stored in files or for text that needs to be sent over the internet. The EncodingTools class offers some easy-to-use functions to determine the best encoding for different scenarios.
I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.
It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.
https://github.com/VioletGiraffe/text-encoding-detector
It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.
If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.
If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.
Assuming you know the length of the input array, you can make the following guesses:
0x80
to 0xff
, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With