Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any reasonable way to access the contents of a CharacterSet?

For a random string generator, I thought it would be nice to use CharacterSet as input type for the alphabet to use, since the pre-defined sets such as CharacterSet.lowercaseLetters are obviously useful (even if they may contain more diverse character sets than you'd expect).

However, apparently you can only query character sets for membership, but not enumerate let alone index them. All we get is _.bitmapRepresentation, a 8kb chunk of data with an indicator bit for every (?) character. But even if you peel out individual bits by index i (which is less than nice, going through byte-oriented Data), Character(UnicodeScalar(i)) does not give the correct letter. Which means that the format is somewhat obscure -- and, of course, it's not documented.

Of course we can iterate over all characters (per plane) but that is a bad idea, cost-wise: a 20-character set may require iterating over tens of thousands of characters. Speaking in CS terms: bit-vectors are a (very) bad implementation for sparse sets. Why they chose to make the trade-off in this way here, I have no idea.

Am I missing something here, or is CharacterSet just another deadend in the Foundation API?

like image 771
Raphael Avatar asked Dec 11 '22 12:12

Raphael


1 Answers

Following the documentation, here is an improvement on Satachito answer to support cases of non-continuous planes, by actually taking into account the plane index:

extension CharacterSet {
    func codePoints() -> [Int] {
        var result: [Int] = []
        var plane = 0
        // following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
        for (i, w) in bitmapRepresentation.enumerated() {
            let k = i % 8193
            if k == 8192 {
                // plane index byte
                plane = Int(w) << 13
                continue
            }
            let base = (plane + k) << 3
            for j in 0 ..< 8 where w & 1 << j != 0 {
                result.append(base + j)
            }
        }
        return result
    }

    func printHexValues() {
        codePoints().forEach { print(String(format:"%02X", $0)) }
    }
}

Usage

print("whitespaces:")
CharacterSet.whitespaces.printHexValues()
print()
print("two characters from different planes:")
CharacterSet(charactersIn: "𝚨󌞑").printHexValues()

Results

whitespaces:
09
20
A0
1680
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
200A
200B
202F
205F
3000

two characters from different planes:
1D6A8
CC791

Performances

This is effectively 3 to 10 times faster than iterating over all characters: comparison is done with the previous answers at NSArray from NSCharacterset.

like image 113
Cœur Avatar answered May 15 '23 20:05

Cœur