Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert from ANSI to UTF-8


I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?

I have found this link but the Python script is not running: notepad++ converting ansi encoded file to utf-8

like image 682
Karan Pappala Avatar asked Jul 17 '15 08:07

Karan Pappala


2 Answers

Why don't you read the file and write it as UTF-8? You can do that in Python.

#to support encodings
import codecs

#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
  lines = file.read()

#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
  file.write(lines)
like image 121
3Ducker Avatar answered Oct 03 '22 05:10

3Ducker


I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.

I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.

The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:

blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
    with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
        while True:
            contents = sourceFile.read(blockSize)
            if not contents:
                break
            targetFile.write(contents)

The below link contains some information on the encoding types that I found on my research

https://docs.python.org/2.4/lib/standard-encodings.html

like image 27
Strebormit Avatar answered Oct 03 '22 05:10

Strebormit