Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transition to Unicode for an application that handles text files

My Win32 Delphi app analyzes text files produced by other applications that do not support Unicode. Thus, my apps needs to read and write ansi strings, but I would like to provide a better-localized user experience through use of Unicode in GUI. The app does some pretty heavy character-by-character analysis of string in objects descended from TList.

In making the transition to a Unicode GUI in going from Delphi 2006 to Delphi 2009, should I plan to:

  1. go fully Unicode within my app, with the exception of ansistring file I/O?
  2. encapsulate the code that handles the ansistrings (i.e. continue to handle them as ansistrings internally) from an otherwise Unicode application.

I realize that a truly detailed response would require a substantial amount of my code - I'm just asking about impressions from those who've made this transition and who still have to work with plain text files. Where to place the barrier between ansistrings and Unicode?

EDIT: if #1, any suggestions for mapping Unicode strings for ansistring output? I would guess that the conversion of input strings will be automatic using tstringlist.loadfromfile (for example).

like image 385
Argalatyr Avatar asked Feb 03 '23 10:02

Argalatyr


1 Answers

There is no such thing as AnsiString output - every text file has a character encoding. The moment that your files contain characters outside of the ASCII range you have to think about encoding, as even loading those files in different countries will produce different results - unless you happen to be using a Unicode encoding.

If you load a text file you need to know which encoding it has. For formats like xml or html that information is part of the text, for Unicode there is the BOM, even though it isn't strictly necessary for UTF-8 encoded files.

Converting an application to Delphi 2009 is a chance to think about encoding of text files and correct past mistakes. Data files of an application do often have a longer life than the applications itself, so it pays to think about how to make them future-proof and universal. I would suggest to go UTF-8 as the text file encoding for all new applications, that way porting an application to different platforms is easy. UTF-8 is the best encoding for data exchange, and for characters in the ASCII or ISO8859-1 range it does also create much smaller files than UTF-16 or UTF-32 even.

If your data files contain only ASCII chars you are all set then, as they are valid UTF-8 encoded files then as well. If your data files are in ISO8859-1 encoding (or any other fixed encoding), then use the matching conversion while loading them into string lists and saving them back. If you don't know in advance what encoding they will have, ask the user upon loading, or provide an application setting for the default encoding.

Use Unicode strings internally. Depending on the amount of data you need to handle you might use UTF-8 encoded strings.

like image 156
mghie Avatar answered Apr 07 '23 17:04

mghie