Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange String.unicodeScalars and CharacterSet behaviour

Tags:

swift

I'm trying to use a Swift 3 CharacterSet to filter characters out of a String but I'm getting stuck very early on. A CharacterSet has a method called contains

func contains(_ member: UnicodeScalar) -> Bool
Test for membership of a particular UnicodeScalar in the CharacterSet.

But testing this doesn't produce the expected behaviour.

let characterSet = CharacterSet.capitalizedLetters

let capitalAString = "A"

if let capitalA = capitalAString.unicodeScalars.first {
    print("Capital A is \(characterSet.contains(capitalA) ? "" : "not ")in the group of capital letters")
} else {
    print("Couldn't get the first element of capitalAString's unicode scalars")
}

I'm getting Capital A is not in the group of capital letters yet I'd expect the opposite.

Many thanks.

like image 466
Josh Paradroid Avatar asked Feb 15 '17 14:02

Josh Paradroid


1 Answers

CharacterSet.capitalizedLetters returns a character set containing the characters in Unicode General Category Lt aka "Letter, titlecase". That are "Ligatures containing uppercase followed by lowercase letters (e.g., Dž, Lj, Nj, and Dz)" (compare Wikipedia: Unicode character property or Unicode® Standard Annex #44 – Table 12. General_Category Values).

You can find a list here: Unicode Characters in the 'Letter, Titlecase' Category.

You can also use the code from NSArray from NSCharacterset to dump the contents of the character set:

extension CharacterSet {
    func allCharacters() -> [Character] {
        var result: [Character] = []
        for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
            for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
                if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
                    result.append(Character(uniChar))
                }
            }
        }
        return result
    }
}

let characterSet = CharacterSet.capitalizedLetters
print(characterSet.allCharacters())

// ["Dž", "Lj", "Nj", "Dz", "ᾈ", "ᾉ", "ᾊ", "ᾋ", "ᾌ", "ᾍ", "ᾎ", "ᾏ", "ᾘ", "ᾙ", "ᾚ", "ᾛ", "ᾜ", "ᾝ", "ᾞ", "ᾟ", "ᾨ", "ᾩ", "ᾪ", "ᾫ", "ᾬ", "ᾭ", "ᾮ", "ᾯ", "ᾼ", "ῌ", "ῼ"]

What you probably want is CharacterSet.uppercaseLetters which

Returns a character set containing the characters in Unicode General Category Lu and Lt.

like image 179
Martin R Avatar answered Oct 22 '22 19:10

Martin R