Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#. Thanks

Assuming you know the length of the input array, you can make the following guesses: <ol> <li>First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!</li> <li>Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.</li> <li>If any character is from <code>0x80</code> to <code>0xff</code>, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.</li> <li>At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters. </li> </ol>

Detect encoding of a string in C/C++

3 Answers

I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.

It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.

https://github.com/VioletGiraffe/text-encoding-detector

113

answered Sep 29 '22 08:09

Violet Giraffe

It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.

If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.

If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.

answered Sep 29 '22 07:09

russw_uk

Assuming you know the length of the input array, you can make the following guesses:

First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.

answered Sep 29 '22 07:09

MSN

Related questions
                            
                                is there a simple php shell for windows?
                            
                                How to consume real-time ETW events from the Microsoft-Windows-NDIS-PacketCapture provider?
                            
                                How do I manipulate PowerShell's location stack with a forth-like swap operation?
                            
                                Can select() be used with files in Python under Windows?
                            
                                How to obtain Windows host key for RDP sessions? [closed]
                            
                                MinGW and std::thread
                            
                                Catching TimeoutExpired exception in Python 3.3
                            
                                WshShell.AppActivate doesn't seem to work in simple vbs script
                            
                                Is it possible to update an existing Windows Phone 8 app to Windows Phone Store 8.1
                            
                                Why does a non-interactive batch script think I've pressed control-C?
                            
                                Warning : HTML 1300 Navigation occured?
                            
                                error: 'make_array' is not a member of 'boost::serialization
                            
                                How to view folder permission in windows using command line for particular user?
                            
                                Can a TEdit show color emoji?
                            
                                Portable scripting language for a multi-server admin?
                            
                                cscript - print output on same line on console?
                            
                                Limits on Windows environment variable nesting?
                            
                                Make Python respond to Windows timezone changes
                            
                                How does Google Chrome manage to execute installation automatically after download?
                            
                                Import CSV file into Sqlite3 Database in command-line or via Batch File

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Detect encoding of a string in C/C++

Tags:

windows

character-encoding

visual-c++

jAckOdE

People also ask

3 Answers

Violet Giraffe

russw_uk

MSN

Recent Activity

Donate For Us