Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I reverse a UTF-8 string in place?

Recently, someone asked about an algorithm for reversing a string in place in C. Most of the proposed solutions had troubles when dealing with non single-byte strings. So, I was wondering what could be a good algorithm for dealing specifically with utf-8 strings.

I came up with some code, which I'm posting as an answer, but I'd be glad to see other people's ideas or suggestions. I preferred to use actual code, so I've chosen C#, as it seems to be one of the most popular language in this site, but I don't mind if your code is in another language, as long as it could be reasonably understood by anyone who is familiar with an imperative language. And, as this is intended to see how such an algorithm could be implemented at a low-level (by low-level I just mean dealing with bytes), the idea is to avoid using libraries for the core code.

Notes:

I'm interested in the algorithm itself, its performance and how could it be optimized (I mean algorithm-level optimization, not replacing i++ with ++i and such; I'm not really interested in actual benchmarks either).

I don't mean to actually use it in production code or "reinventing the wheel". This is just out of curiosity and as an exercise.

I'm using C# byte arrays so I'm assuming you can get the length of the string without running though the string until you find a NUL. That is, I'm not accounting for the complexity of finding the length of the string. But if you're using C, for instance, you could factor that out by using strlen() before calling the core code.

Edit:

As Mike F points out, my code (and other people's code posted here) is not dealing with composite characters. Some info about those here. I'm not familiar with the concept, but if that means that there are "combining characters", i.e., characters / code points that are only valid in combination with other "base" characters / code points, a look-up table of such characters could be used to preserve the order of the "global" character ("base" + "combining" characters) when reversing.

like image 864
Juan Pablo Califano Avatar asked Oct 13 '08 22:10

Juan Pablo Califano


People also ask

How do I encode a string in UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

Is UTF-8 the default encoding?

Fortunately UTF-8 is the default per sé. When reading an XML document and writing it in another encoding, mostly this attribute will be patched too.

Can UTF-8 handle all characters?

UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.


1 Answers

I'd make one pass reversing the bytes, then a second pass that reverses the bytes in any multibyte characters (which are easily detected in UTF8) back to their correct order.

You can definitely handle this in line in a single pass, but I wouldn't bother unless the routine became a bottleneck.

like image 117
Michael Burr Avatar answered Oct 23 '22 05:10

Michael Burr