Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flip text horizontally?

i'm need to write a function that will flip all the characters of a string left-to-right.

e.g.:

Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog.

should become

.goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT

i can limit the question to UTF-16 (which has the same problems as UTF-8, just less often).

Naive solution

A naive solution might try to flip all the things (e.g. word-for-word, where a word is 16-bits - i would have said byte for byte if we could assume that a byte was 16-bits. i could also say character-for-character where character is the data type Char which represents a single code-point):

String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
   flipped = c+fipped;
}

Results in the incorrectly flipped text:

  • ɗỉf̴ḟếr̆ęnͥt
  • ̨tͥnę̆rếḟ̴fỉɗ

This is because one "character" takes multiple "code points".

  • ɗỉf̴ḟếr̆ęnͥt
  • ɗ f ˜ ế r ˘ ę n i t ˛

and flipping each "code point" gives:

  • ˛ t i n ę ˘ r ế ˜ f ɗ

Which not only is not a valid UTF-16 encoding, it's not the same characters.

Failure

The problem happens in UTF-16 encoding when there is:

  • combining diacritics
  • characters in another lingual plane

Those same issues happen in UTF-8 encoding, with the additional case

  • any character outside the 0..127 ASCII range

i can limit myself to the simpler UTF-16 encoding (since that's the encoding that the language that i'm using has (e.g. C#, Delphi)

The problem, it seems to me, is discovering if a number of subsequent code points are combining characters, and need to come along with the base glyph.

It's also fun to watch an online text reverser site fail to take this into account.

Note:

  • any solution should assume that don't have access to a UTF-32 encoding library (mainly becuase i don't have access to any UTF-32 encoding library)
  • access to a UTF-32 encoding library would solve the UTF-8/UTF-16 lingual planes problem, but not the combining diacritics problem
like image 621
Ian Boyd Avatar asked Oct 08 '22 22:10

Ian Boyd


1 Answers

The term you're looking for is “grapheme cluster”, as defined in Unicode TR29 Cluster Boundaries.

Group the UTF-16 code units into Unicode code points (=characters) using the surrogate algorithm (easy), then group the characters into grapheme clusters using the Grapheme_Cluster_Break rules. Finally reverse the group order.

You will need a copy of the Unicode character database in order to recognise grapheme cluster boundaries. That's already going to take up a considerable amount of space, so you're probably going to want to get a library to do it. For example in ICU you might use a CharacterIterator (which is misleadingly named as it works on grapheme clusters, not ‘characters’ as Unicode knows it).

like image 165
bobince Avatar answered Oct 12 '22 01:10

bobince