How can I detect the encoding/codepage of a text file

Tags:

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

Is there a way to (automatically) detect the codepage of a text file?

The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850, windows1252.

Thanks for your answers, this is what I've done.

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

Solution:

Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
Loop through all codepages, and display the ones that give a solution with the user provided text.
If more as one codepage pops up, ask the user to specify more text.

372

asked Sep 18 '08 08:09

GvS

1 Answers

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

187

answered Oct 07 '22 23:10

JV.

Related questions
                            
                                jQuery UI Dialog with ASP.NET button postback
                            
                                The opposite of Intersect()
                            
                                What happens if a finally block throws an exception?
                            
                                .NET Core vs Mono
                            
                                What is the command to exit a Console application in C#?
                            
                                Best practices for catching and re-throwing .NET exceptions
                            
                                List of Timezone IDs for use with FindTimeZoneById() in C#?
                            
                                Can you get the column names from a SqlDataReader?
                            
                                ThreadStart with parameters
                            
                                Why does ReSharper tell me "implicitly captured closure"?
                            
                                What is the C# Using block and why should I use it? [duplicate]
                            
                                Getting file names without extensions
                            
                                How do I get my C# program to sleep for 50 msec?
                            
                                How do I decode a base64 encoded string?
                            
                                ASP.NET 5 MVC: unable to connect to web server 'IIS Express'
                            
                                Invert "if" statement to reduce nesting
                            
                                Does using "new" on a struct allocate it on the heap or stack?
                            
                                How can I create a temp file with a specific extension with .NET?
                            
                                Round double in two decimal places in C#?
                            
                                Cannot find JavaScriptSerializer in .Net 4.0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I detect the encoding/codepage of a text file

Tags:

c#

.net

text

encoding

globalization

GvS

People also ask

1 Answers

JV.

Recent Activity

Donate For Us