For a random string generator, I thought it would be nice to use CharacterSet
as input type for the alphabet to use, since the pre-defined sets such as CharacterSet.lowercaseLetters
are obviously useful (even if they may contain more diverse character sets than you'd expect).
However, apparently you can only query character sets for membership, but not enumerate let alone index them. All we get is _.bitmapRepresentation
, a 8kb chunk of data with an indicator bit for every (?) character. But even if you peel out individual bits by index i
(which is less than nice, going through byte-oriented Data
), Character(UnicodeScalar(i))
does not give the correct letter. Which means that the format is somewhat obscure -- and, of course, it's not documented.
Of course we can iterate over all characters (per plane) but that is a bad idea, cost-wise: a 20-character set may require iterating over tens of thousands of characters. Speaking in CS terms: bit-vectors are a (very) bad implementation for sparse sets. Why they chose to make the trade-off in this way here, I have no idea.
Am I missing something here, or is CharacterSet
just another deadend in the Foundation
API?
Following the documentation, here is an improvement on Satachito answer to support cases of non-continuous planes, by actually taking into account the plane index:
extension CharacterSet {
func codePoints() -> [Int] {
var result: [Int] = []
var plane = 0
// following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
for (i, w) in bitmapRepresentation.enumerated() {
let k = i % 8193
if k == 8192 {
// plane index byte
plane = Int(w) << 13
continue
}
let base = (plane + k) << 3
for j in 0 ..< 8 where w & 1 << j != 0 {
result.append(base + j)
}
}
return result
}
func printHexValues() {
codePoints().forEach { print(String(format:"%02X", $0)) }
}
}
print("whitespaces:")
CharacterSet.whitespaces.printHexValues()
print()
print("two characters from different planes:")
CharacterSet(charactersIn: "𝚨").printHexValues()
whitespaces:
09
20
A0
1680
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
200A
200B
202F
205F
3000
two characters from different planes:
1D6A8
CC791
This is effectively 3 to 10 times faster than iterating over all characters: comparison is done with the previous answers at NSArray from NSCharacterset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With