My goal: given an arbitrary UTF-16 position in a String
, find the corresponding String.Index
that represents the Character
(i.e. the extended grapheme cluster) the specified UTF-16 code unit is a part of.
Example:
(I put the code in a Gist for easy copying and pasting.)
This is my test string:
let str = "π¨πΎβπ"
(Note: to see the string as a single character, you need to read this on a reasonably recent OS/browser combination that can handle the new profession emoji with skin tones introduced in Unicode 9.)
It's a single Character
(grapheme cluster) that consists of four Unicode scalars or 7 UTF-16 code units:
print(str.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// β ["0x1f468", "0x1f3fe", "0x200d", "0x1f692"]
print(str.utf16.map { "0x\(String($0, radix: 16))" })
// β ["0xd83d", "0xdc68", "0xd83c", "0xdffe", "0x200d", "0xd83d", "0xde92"]
print(str.utf16.count)
// β 7
Given an arbitrary UTF-16 offset (say, 2), I can create a corresponding String.Index
:
let utf16Offset = 2
let utf16Index = String.Index(encodedOffset: utf16Offset)
I can subscript the string with this index, but if the index doesn't fall on a Character
boundary, the Character
returned by the subscript might not cover the entire grapheme cluster:
let char = str[utf16Index]
print(char)
// β πΎβπ
print(char.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// β ["0x1f3fe", "0x200d", "0x1f692"]
Or the subscript operation might even trap (I'm not sure this is intended behavior):
let trappingIndex = String.Index(encodedOffset: 1)
str[trappingIndex]
// fatal error: Can't form a Character from a String containing more than one extended grapheme cluster
You can test if an index falls on a Character
boundary:
extension String.Index {
func isOnCharacterBoundary(in str: String) -> Bool {
return String.Index(self, within: str) != nil
}
}
trappingIndex.isOnCharacterBoundary(in: str)
// β false (as expected)
utf16Index.isOnCharacterBoundary(in: str)
// β true (WTF!)
The Issue:
I think the problem is that this last expression returns true
. The documentation for String.Index.init(_:within:)
says:
If the index passed as
sourcePosition
represents the start of an extended grapheme clusterβthe element type of a stringβthen the initializer succeeds.
Here, utf16Index
doesn't represent the start of an extended grapheme cluster β the grapheme cluster starts at offset 0, not offset 2. Yet the initializer succeeds.
As a result, all my attempts to find the start of the grapheme cluster by repeatedly decrementing the index's encodedOffset
and testing isOnCharacterBoundary
fail.
Am I overlooking something? Is there another way to test if an index falls on the start of a Character
? Is this a bug in Swift?
My environment: Swift 4.0/Xcode 9.0 on macOS 10.13.
Update: Check out the interesting Twitter thread about this question.
Update: I reported the behavior of String.Index.init?(_:within:)
in Swift 4.0 as a bug: SR-5992.
A possible solution, using the rangeOfComposedCharacterSequence(at:)
method:
extension String {
func index(utf16Offset: Int) -> String.Index? {
guard utf16Offset >= 0 && utf16Offset < utf16.count else { return nil }
let idx = String.Index(encodedOffset: utf16Offset)
let range = rangeOfComposedCharacterSequence(at: idx)
return range.lowerBound
}
}
Example:
let str = "aπ¨πΎβπbπ©πͺcπdπ©βπ©βπ§βπ§e"
for utf16Offset in 0..<str.utf16.count {
if let idx = str.index(utf16Offset: utf16Offset) {
print(utf16Offset, str[idx])
}
}
Output:
0 a 1 π¨πΎβπ 2 π¨πΎβπ 3 π¨πΎβπ 4 π¨πΎβπ 5 π¨πΎβπ 6 π¨πΎβπ 7 π¨πΎβπ 8 b 9 π©πͺ 10 π©πͺ 11 π©πͺ 12 π©πͺ 13 c 14 π 15 π 16 d 17 π©βπ©βπ§βπ§ 18 π©βπ©βπ§βπ§ 19 π©βπ©βπ§βπ§ 20 π©βπ©βπ§βπ§ 21 π©βπ©βπ§βπ§ 22 π©βπ©βπ§βπ§ 23 π©βπ©βπ§βπ§ 24 π©βπ©βπ§βπ§ 25 π©βπ©βπ§βπ§ 26 π©βπ©βπ§βπ§ 27 π©βπ©βπ§βπ§ 28 e
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With