From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

Question

My goal: given an arbitrary UTF-16 position in a String, find the corresponding String.Index that represents the Character (i.e. the extended grapheme cluster) the specified UTF-16 code unit is a part of.

Example:

(I put the code in a Gist for easy copying and pasting.)

This is my test string:

let str = "👨🏾‍🚒"

(Note: to see the string as a single character, you need to read this on a reasonably recent OS/browser combination that can handle the new profession emoji with skin tones introduced in Unicode 9.)

It's a single Character (grapheme cluster) that consists of four Unicode scalars or 7 UTF-16 code units:

print(str.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// → ["0x1f468", "0x1f3fe", "0x200d", "0x1f692"]
print(str.utf16.map { "0x\(String($0, radix: 16))" })
// → ["0xd83d", "0xdc68", "0xd83c", "0xdffe", "0x200d", "0xd83d", "0xde92"]
print(str.utf16.count)
// → 7

Given an arbitrary UTF-16 offset (say, 2), I can create a corresponding String.Index:

let utf16Offset = 2
let utf16Index = String.Index(encodedOffset: utf16Offset)

I can subscript the string with this index, but if the index doesn't fall on a Character boundary, the Character returned by the subscript might not cover the entire grapheme cluster:

let char = str[utf16Index]
print(char)
// → 🏾‍🚒
print(char.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// → ["0x1f3fe", "0x200d", "0x1f692"]

Or the subscript operation might even trap (I'm not sure this is intended behavior):

let trappingIndex = String.Index(encodedOffset: 1)
str[trappingIndex]
// fatal error: Can't form a Character from a String containing more than one extended grapheme cluster

You can test if an index falls on a Character boundary:

extension String.Index {
    func isOnCharacterBoundary(in str: String) -> Bool {
        return String.Index(self, within: str) != nil
    }
}

trappingIndex.isOnCharacterBoundary(in: str)
// → false (as expected)
utf16Index.isOnCharacterBoundary(in: str)
// → true (WTF!)

The Issue:

I think the problem is that this last expression returns true. The documentation for String.Index.init(_:within:) says:

If the index passed as sourcePosition represents the start of an extended grapheme cluster—the element type of a string—then the initializer succeeds.

Here, utf16Index doesn't represent the start of an extended grapheme cluster — the grapheme cluster starts at offset 0, not offset 2. Yet the initializer succeeds.

As a result, all my attempts to find the start of the grapheme cluster by repeatedly decrementing the index's encodedOffset and testing isOnCharacterBoundary fail.

Am I overlooking something? Is there another way to test if an index falls on the start of a Character? Is this a bug in Swift?

My environment: Swift 4.0/Xcode 9.0 on macOS 10.13.

Update: Check out the interesting Twitter thread about this question.

Update: I reported the behavior of String.Index.init?(_:within:) in Swift 4.0 as a bug: SR-5992.

Martin R · Accepted Answer

A possible solution, using the rangeOfComposedCharacterSequence(at:) method:

extension String {
    func index(utf16Offset: Int) -> String.Index? {
        guard utf16Offset >= 0 && utf16Offset < utf16.count else { return nil }
        let idx = String.Index(encodedOffset: utf16Offset)
        let range = rangeOfComposedCharacterSequence(at: idx)
        return range.lowerBound
    }
}

Example:

let str = "a👨🏾‍🚒b🇩🇪c😀d👩‍👩‍👧‍👧e"
for utf16Offset in 0..<str.utf16.count {
    if let idx = str.index(utf16Offset: utf16Offset) {
        print(utf16Offset, str[idx])
    }
}

Output:

0 a
1 👨🏾‍🚒
2 👨🏾‍🚒
3 👨🏾‍🚒
4 👨🏾‍🚒
5 👨🏾‍🚒
6 👨🏾‍🚒
7 👨🏾‍🚒
8 b
9 🇩🇪
10 🇩🇪
11 🇩🇪
12 🇩🇪
13 c
14 😀
15 😀
16 d
17 👩‍👩‍👧‍👧
18 👩‍👩‍👧‍👧
19 👩‍👩‍👧‍👧
20 👩‍👩‍👧‍👧
21 👩‍👩‍👧‍👧
22 👩‍👩‍👧‍👧
23 👩‍👩‍👧‍👧
24 👩‍👩‍👧‍👧
25 👩‍👩‍👧‍👧
26 👩‍👩‍👧‍👧
27 👩‍👩‍👧‍👧
28 e

From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

Tags:

Ole Begemann

1 Answers

Martin R

Recent Activity

Donate For Us

From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

Tags:

Ole Begemann

1 Answers

Martin R

Related questions

Recent Activity

Donate For Us