How do people implement UTF-8 in Smalltalk?

Question

I've been doing some preliminary effort implementing UTF8String for which I had to address the problems related with messages such as #size, #at:, #do:, etc. Among these there are some for which I could not find a good solution. Examples include #new: (class side) and #at:put: (instance) because the number of bytes they would need (or use) depends on the actual characters the string will eventually contain.

One idea that one could consider is to allocate additional (unused) null bytes in the tail which would actually not be part of the string and use #become: only in those cases were one runs out of null positions. Is this a good (or bad) idea? How should a proper implementation work?

aka.nice · Accepted Answer

One solution would be to hold the sequence of bytes into an instance variable (a ByteArray) anf thus use a normal pointer based subclass instead of using a variableByteSubclass.

Then the strategy of pre-allocating extra bytes can be easily implemented since you would store effective size into another instance variable. Up to you to tune code complexity/efficiency, memory/speed balance.

The advantage is to avoid messing with other VM primitives like copyReplaceFrom:to:with:startingAt: which can transfer raw encoding from one byte oriented class to another, potentially creating erroneous interpretation of the encodings.

Another advantage is that you don't need to invoke the become: super-power.

How do people implement UTF-8 in Smalltalk?

Tags:

utf-8

smalltalk

Leandro Caniglia

1 Answers

aka.nice

Recent Activity

Donate For Us

How do people implement UTF-8 in Smalltalk?

Tags:

utf-8

smalltalk

Leandro Caniglia

1 Answers

aka.nice

Related questions

Recent Activity

Donate For Us