Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

encoding and decoding strings in dm-script and Python

Tags:

dm-script

My files sometimes contain multibyte characters in file-path or text in the file. And I can not read the correct strings from the file. I know the best way is to avoid using multibyte characters...
Unfortunately system encoding of my PC is Shift-JIS, not UTF. This is the default setting of Windows system in my country. And the default encoding of Python is UTF-8.

This is examples of reading multibyte-charactres from dm-script and python. "shift-jis.txt" and "utf-8.txt" have the same strings but different encoding.

TagGroup txtTg(string filepath,number encoding)
{
    string text
    TagGroup TextInFileTg=NewTagList()
    number fileID=OpenFileForReading(filepath)
    while(ReadFileLine(fileID,encoding,text))
    {
        TextInFileTg.TagGroupInsertTagAsString(infinity(),text)
    }
    CloseFile(fileID)
    return TextInFileTg
}

string dirpath="C:\\Users\\arksa\\Documents\\scripts"

TagGroup flistTg=GetFilesInDirectory(dirpath,1)
string ext="txt"
string fname,fpath,text
number fileID
TagGroup fTg=NewTagGroup()
number encoding
for(number i=0;i<flistTg.TagGroupCountTags();i++)
{
    flistTg.TagGroupGetTagAsString("["+i+"]:Name",fname)
    fpath=PathConcatenate(dirpath,fname)
    if(fname.PathExtractExtension(0)==ext)
    {
        if(fname=="shift-jis.txt")
        {
            encoding=0
            fTg.TagGroupSetTagAsTagGroup("DM can read "+fname,txtTg(fpath,encoding))
        }
        if(fname=="utf-8.txt")
        {
            for(number j=0;j<3;j++)
            {
                encoding=j
                fTg.TagGroupSetTagAsTagGroup("DM cannot read "+fname+":"+"encoding="+encoding,txtTg(fpath,encoding))
            }
        }
    }
}

fTg.TagGroupOpenBrowserWindow("text",0)

enter image description here

import DigitalMicrograph as DM
import os

dirpath=r'C:\Users\arksa\Documents\scripts'
files=['shift-jis.txt','utf-8.txt']
encodings=['shift-jis','utf-8']

tag=DM.NewTagGroup()
for i in range(len(files)):
    path=os.path.join(dirpath,files[i])
    ftag=DM.NewTagList()
    with open(path,encoding=encodings[i]) as f:
        for fline in f:
            ftag.InsertTagAsString(-1,str(fline).encode('shift-jis').decode('utf-8','backslashreplace'))
            #print(fline)
    tag.SetTagAsTagGroup("writing strings in "+files[i]+" to tag group from python fails.",ftag)

tag.OpenBrowserWindow(True)

enter image description here

DM-script can read a shift-jis text file correctly, but not for an utf-8 file. Python can read both shift-jis and utf-8 files, but writing strings to TagGroup fails. If encoding of multibyte characters is the identical to Windows encoding, only dm-script can deal with strings correctly. I cannot find the function to write string to TagGroup with arbitrary encoding.

like image 684
arksakura Avatar asked Dec 30 '25 09:12

arksakura


1 Answers

Not an answer, but I found something potential useful:

List of "encoding" parameter values as used in script commands in GMS 3.x:

// ENCODE parameter values

SYSTEM_MULTIBYTE = 0x00000000
GATAN            = 0x00000001
UNICODE          = 0x00000002
ROMAN        = 0x01000000
JAPANESE     = 0x01000001
CHINESE_TRAD = 0x01000002
KOREAN       = 0x01000003
ARABIC       = 0x01000004
HEBREW       = 0x01000005
GREEK        = 0x01000006
CYRILLIC     = 0x01000007
DEVANAGARI   = 0x01000009
GURMUKHI     = 0x0100000A
GUJARATI     = 0x0100000B
THAI         = 0x01000015
CHINESE_SIMP = 0x01000019
EASTEUROPE   = 0x0100001D
TURKISH      = 0x01000023       
BALTIC       = 0x01000100      

like image 67
BmyGuest Avatar answered Jan 02 '26 02:01

BmyGuest