Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to identify unicode encoded text files in Windows?

I am working on a codebase which has some unicode encoded files scattered throughout as a result of multiple team members developing with different editors (and default settings). I would like to clean up our code base by finding all the unicode encoded files and converting them back to ANSI encoding.

Any thoughts on how to accomplish the "finding" part of this task would be truly appreciated.

like image 226
HOCA Avatar asked Jan 12 '11 18:01

HOCA


People also ask

How do I know the encoding of a text file?

Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.

How do you check if a .TXT file is in ascii or UTF-8 format?

Open the file in Notepad. Click 'Save As...'. In the 'Encoding:' combo box you will see the current file format. Yes, I opened the file in notepad and selected the UTF-8 format and saved it.

How do I know if my file is UTF-16 or UTF-8?

There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...


1 Answers

See “How to detect the character encoding of a text-file?” or “How to reliably guess the encoding [...]?”

  • UTF-8 can be detected with validation. You can also look for the BOM EF BB BF, but don't rely on it.
  • UTF-16 can be detected by looking for the BOM.
  • UTF-32 can be detected by validation, or by the BOM.
  • Otherwise assume the ANSI code page.

Our codebase doesn't include any non-ASCII chars. I will try to grep for the BOM in files in our codebase. Thanks for the clarification.

Well that makes things a lot simpler. UTF-8 without non-ASCII chars is ASCII.

like image 142
dan04 Avatar answered Sep 18 '22 05:09

dan04