Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

Tags:

My goal: given an arbitrary UTF-16 position in a String, find the corresponding String.Index that represents the Character (i.e. the extended grapheme cluster) the specified UTF-16 code unit is a part of.

Example:

(I put the code in a Gist for easy copying and pasting.)

This is my test string:

let str = "πŸ‘¨πŸΎβ€πŸš’"

(Note: to see the string as a single character, you need to read this on a reasonably recent OS/browser combination that can handle the new profession emoji with skin tones introduced in Unicode 9.)

It's a single Character (grapheme cluster) that consists of four Unicode scalars or 7 UTF-16 code units:

print(str.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// β†’ ["0x1f468", "0x1f3fe", "0x200d", "0x1f692"]
print(str.utf16.map { "0x\(String($0, radix: 16))" })
// β†’ ["0xd83d", "0xdc68", "0xd83c", "0xdffe", "0x200d", "0xd83d", "0xde92"]
print(str.utf16.count)
// β†’ 7

Given an arbitrary UTF-16 offset (say, 2), I can create a corresponding String.Index:

let utf16Offset = 2
let utf16Index = String.Index(encodedOffset: utf16Offset)

I can subscript the string with this index, but if the index doesn't fall on a Character boundary, the Character returned by the subscript might not cover the entire grapheme cluster:

let char = str[utf16Index]
print(char)
// β†’ πŸΎβ€πŸš’
print(char.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// β†’ ["0x1f3fe", "0x200d", "0x1f692"]

Or the subscript operation might even trap (I'm not sure this is intended behavior):

let trappingIndex = String.Index(encodedOffset: 1)
str[trappingIndex]
// fatal error: Can't form a Character from a String containing more than one extended grapheme cluster

You can test if an index falls on a Character boundary:

extension String.Index {
    func isOnCharacterBoundary(in str: String) -> Bool {
        return String.Index(self, within: str) != nil
    }
}

trappingIndex.isOnCharacterBoundary(in: str)
// β†’ false (as expected)
utf16Index.isOnCharacterBoundary(in: str)
// β†’ true (WTF!)

The Issue:

I think the problem is that this last expression returns true. The documentation for String.Index.init(_:within:) says:

If the index passed as sourcePosition represents the start of an extended grapheme clusterβ€”the element type of a stringβ€”then the initializer succeeds.

Here, utf16Index doesn't represent the start of an extended grapheme cluster β€” the grapheme cluster starts at offset 0, not offset 2. Yet the initializer succeeds.

As a result, all my attempts to find the start of the grapheme cluster by repeatedly decrementing the index's encodedOffset and testing isOnCharacterBoundary fail.

Am I overlooking something? Is there another way to test if an index falls on the start of a Character? Is this a bug in Swift?

My environment: Swift 4.0/Xcode 9.0 on macOS 10.13.

Update: Check out the interesting Twitter thread about this question.

Update: I reported the behavior of String.Index.init?(_:within:) in Swift 4.0 as a bug: SR-5992.

like image 811
Ole Begemann Avatar asked Sep 25 '17 15:09

Ole Begemann


1 Answers

A possible solution, using the rangeOfComposedCharacterSequence(at:) method:

extension String {
    func index(utf16Offset: Int) -> String.Index? {
        guard utf16Offset >= 0 && utf16Offset < utf16.count else { return nil }
        let idx = String.Index(encodedOffset: utf16Offset)
        let range = rangeOfComposedCharacterSequence(at: idx)
        return range.lowerBound
    }
}

Example:

let str = "aπŸ‘¨πŸΎβ€πŸš’bπŸ‡©πŸ‡ͺcπŸ˜€dπŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§e"
for utf16Offset in 0..<str.utf16.count {
    if let idx = str.index(utf16Offset: utf16Offset) {
        print(utf16Offset, str[idx])
    }
}

Output:

0 a
1 πŸ‘¨πŸΎβ€πŸš’
2 πŸ‘¨πŸΎβ€πŸš’
3 πŸ‘¨πŸΎβ€πŸš’
4 πŸ‘¨πŸΎβ€πŸš’
5 πŸ‘¨πŸΎβ€πŸš’
6 πŸ‘¨πŸΎβ€πŸš’
7 πŸ‘¨πŸΎβ€πŸš’
8 b
9 πŸ‡©πŸ‡ͺ
10 πŸ‡©πŸ‡ͺ
11 πŸ‡©πŸ‡ͺ
12 πŸ‡©πŸ‡ͺ
13 c
14 πŸ˜€
15 πŸ˜€
16 d
17 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
18 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
19 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
20 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
21 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
22 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
23 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
24 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
25 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
26 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
27 πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§
28 e 
like image 108
Martin R Avatar answered Oct 11 '22 15:10

Martin R