The character 👩&zwj;👩&zwj;👧&zwj;👦 (family with two women, one girl, and one boy) is encoded as such: <code>U+1F469</code> <code>WOMAN</code>, <code>&zwj;U+200D</code> <code>ZWJ</code>, <code>U+1F469</code> <code>WOMAN</code>, <code>U+200D</code> <code>ZWJ</code>, <code>U+1F467</code> <code>GIRL</code>, <code>U+200D</code> <code>ZWJ</code>, <code>U+1F466</code> <code>BOY</code> So it's very interestingly-encoded; the perfect target for a unit test. However, Swift doesn't seem to know how to treat it. Here's what I mean: <pre class="prettyprint"><code>"👩&zwj;👩&zwj;👧&zwj;👦".contains("👩&zwj;👩&zwj;👧&zwj;👦") // true "👩&zwj;👩&zwj;👧&zwj;👦".contains("👩") // false "👩&zwj;👩&zwj;👧&zwj;👦".contains("\u{200D}") // false "👩&zwj;👩&zwj;👧&zwj;👦".contains("👧") // false "👩&zwj;👩&zwj;👧&zwj;👦".contains("👦") // true </code></pre> So, Swift says it contains itself (good) and a boy (good!). But it then says it does not contain a woman, girl, or zero-width joiner. What's happening here? Why does Swift know it contains a boy but not a woman or girl? I could understand if it treated it as a single character and only recognized it containing itself, but the fact that it got one subcomponent and no others baffles me. This does not change if I use something like <code>"👩".characters.first!</code>. <hr> Even more confounding is this: <pre class="prettyprint"><code>let manual = "\u{1F469}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}" Array(manual.characters) // ["👩&zwj;", "👩&zwj;", "👧&zwj;", "👦"] </code></pre> Even though I placed the ZWJs in there, they aren't reflected in the character array. What followed was a little telling: <pre class="prettyprint"><code>manual.contains("👩") // false manual.contains("👧") // false manual.contains("👦") // true </code></pre> So I get the same behavior with the character array... which is supremely annoying, since I know what the array looks like. This also does not change if I use something like <code>"👩".characters.first!</code>.

It seems that Swift considers a <code>ZWJ</code> to be an extended grapheme cluster with the character immediately preceding it. We can see this when mapping the array of characters to their <code>unicodeScalars</code>: <pre class="prettyprint"><code>Array(manual.characters).map { $0.description.unicodeScalars } </code></pre> This prints the following from LLDB: <pre class="prettyprint"><code>▿ 4 elements ▿ 0 : StringUnicodeScalarView("👩&zwj;") - 0 : "\u{0001F469}" - 1 : "\u{200D}" ▿ 1 : StringUnicodeScalarView("👩&zwj;") - 0 : "\u{0001F469}" - 1 : "\u{200D}" ▿ 2 : StringUnicodeScalarView("👧&zwj;") - 0 : "\u{0001F467}" - 1 : "\u{200D}" ▿ 3 : StringUnicodeScalarView("👦") - 0 : "\u{0001F466}" </code></pre> Additionally, <code>.contains</code> groups extended grapheme clusters into a single character. For instance, taking the hangul characters <code>ᄒ</code>, <code>ᅡ</code>, and <code>ᆫ</code> (which combine to make the Korean word for "one": <code>한</code>): <pre class="prettyprint"><code>"\u{1112}\u{1161}\u{11AB}".contains("\u{1112}") // false </code></pre> This could not find <code>ᄒ</code> because the three codepoints are grouped into one cluster which acts as one character. Similarly, <code>\u{1F469}\u{200D}</code> (<code>WOMAN</code> <code>ZWJ</code>) is one cluster, which acts as one character.

Why are emoji characters like 👩‍👩‍👧‍👦 treated so strangely in Swift strings?

Tags:

string

swift

unicode

emoji

The character 👩‍👩‍👧‍👦 (family with two women, one girl, and one boy) is encoded as such:

U+1F469 WOMAN,
‍U+200D ZWJ,
U+1F469 WOMAN,
U+200D ZWJ,
U+1F467 GIRL,
U+200D ZWJ,
U+1F466 BOY

So it's very interestingly-encoded; the perfect target for a unit test. However, Swift doesn't seem to know how to treat it. Here's what I mean:

"👩‍👩‍👧‍👦".contains("👩‍👩‍👧‍👦") // true
"👩‍👩‍👧‍👦".contains("👩") // false
"👩‍👩‍👧‍👦".contains("\u{200D}") // false
"👩‍👩‍👧‍👦".contains("👧") // false
"👩‍👩‍👧‍👦".contains("👦") // true

So, Swift says it contains itself (good) and a boy (good!). But it then says it does not contain a woman, girl, or zero-width joiner. What's happening here? Why does Swift know it contains a boy but not a woman or girl? I could understand if it treated it as a single character and only recognized it containing itself, but the fact that it got one subcomponent and no others baffles me.

This does not change if I use something like "👩".characters.first!.

Even more confounding is this:

let manual = "\u{1F469}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}"
Array(manual.characters) // ["👩‍", "👩‍", "👧‍", "👦"]

Even though I placed the ZWJs in there, they aren't reflected in the character array. What followed was a little telling:

manual.contains("👩") // false
manual.contains("👧") // false
manual.contains("👦") // true

So I get the same behavior with the character array... which is supremely annoying, since I know what the array looks like.

This also does not change if I use something like "👩".characters.first!.

777

asked Apr 25 '17 18:04

Ky.

3 Answers

This has to do with how the String type works in Swift, and how the contains(_:) method works.

The '👩‍👩‍👧‍👦 ' is what's known as an emoji sequence, which is rendered as one visible character in a string. The sequence is made up of Character objects, and at the same time it is made up of UnicodeScalar objects.

If you check the character count of the string, you'll see that it is made up of four characters, while if you check the unicode scalar count, it will show you a different result:

print("👩‍👩‍👧‍👦".characters.count)     // 4
print("👩‍👩‍👧‍👦".unicodeScalars.count) // 7

Now, if you parse through the characters and print them, you'll see what seems like normal characters, but in fact the three first characters contain both an emoji as well as a zero-width joiner in their UnicodeScalarView:

for char in "👩‍👩‍👧‍👦".characters {
    print(char)

    let scalars = String(char).unicodeScalars.map({ String($0.value, radix: 16) })
    print(scalars)
}

// 👩‍
// ["1f469", "200d"]
// 👩‍
// ["1f469", "200d"]
// 👧‍
// ["1f467", "200d"]
// 👦
// ["1f466"]

As you can see, only the last character does not contain a zero-width joiner, so when using the contains(_:) method, it works as you'd expect. Since you aren't comparing against emoji containing zero-width joiners, the method won't find a match for any but the last character.

To expand on this, if you create a String which is composed of an emoji character ending with a zero-width joiner, and pass it to the contains(_:) method, it will also evaluate to false. This has to do with contains(_:) being the exact same as range(of:) != nil, which tries to find an exact match to the given argument. Since characters ending with a zero-width joiner form an incomplete sequence, the method tries to find a match for the argument while combining characters ending with a zero-width joiners into a complete sequence. This means that the method won't ever find a match if:

the argument ends with a zero-width joiner, and
the string to parse doesn't contain an incomplete sequence (i.e. ending with a zero-width joiner and not followed by a compatible character).

To demonstrate:

let s = "\u{1f469}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}" // 👩‍👩‍👧‍👦

s.range(of: "\u{1f469}\u{200d}") != nil                            // false
s.range(of: "\u{1f469}\u{200d}\u{1f469}") != nil                   // false

However, since the comparison only looks ahead, you can find several other complete sequences within the string by working backwards:

s.range(of: "\u{1f466}") != nil                                    // true
s.range(of: "\u{1f467}\u{200d}\u{1f466}") != nil                   // true
s.range(of: "\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}") != nil  // true

// Same as the above:
s.contains("\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}")          // true

The easiest solution would be to provide a specific compare option to the range(of:options:range:locale:) method. The option String.CompareOptions.literal performs the comparison on an exact character-by-character equivalence. As a side note, what's meant by character here is not the Swift Character, but the UTF-16 representation of both the instance and comparison string – however, since String doesn't allow malformed UTF-16, this is essentially equivalent to comparing the Unicode scalar representation.

Here I've overloaded the Foundation method, so if you need the original one, rename this one or something:

extension String {
    func contains(_ string: String) -> Bool {
        return self.range(of: string, options: String.CompareOptions.literal) != nil
    }
}

Now the method works as it "should" with each character, even with incomplete sequences:

s.contains("👩")          // true
s.contains("👩\u{200d}")  // true
s.contains("\u{200d}")    // true

128

answered Oct 23 '22 19:10

xoudini

The first problem is you're bridging to Foundation with contains (Swift's String is not a Collection), so this is NSString behavior, which I don't believe handles composed Emoji as powerfully as Swift. That said, Swift I believe is implementing Unicode 8 right now, which also needed revision around this situation in Unicode 10 (so this may all change when they implement Unicode 10; I haven't dug into whether it will or not).

To simplify thing, let's get rid of Foundation, and use Swift, which provides views that are more explicit. We'll start with characters:

"👩‍👩‍👧‍👦".characters.forEach { print($0) }
👩‍
👩‍
👧‍
👦

OK. That's what we expected. But it's a lie. Let's see what those characters really are.

"👩‍👩‍👧‍👦".characters.forEach { print(String($0).unicodeScalars.map{$0}) }
["\u{0001F469}", "\u{200D}"]
["\u{0001F469}", "\u{200D}"]
["\u{0001F467}", "\u{200D}"]
["\u{0001F466}"]

Ah… So it's ["👩ZWJ", "👩ZWJ", "👧ZWJ", "👦"]. That makes everything a bit more clear. 👩 is not a member of this list (it's "👩ZWJ"), but 👦 is a member.

The problem is that Character is a "grapheme cluster," which composes things together (like attaching the ZWJ). What you're really searching for is a unicode scalar. And that works exactly as you're expecting:

"👩‍👩‍👧‍👦".unicodeScalars.contains("👩") // true
"👩‍👩‍👧‍👦".unicodeScalars.contains("\u{200D}") // true
"👩‍👩‍👧‍👦".unicodeScalars.contains("👧") // true
"👩‍👩‍👧‍👦".unicodeScalars.contains("👦") // true

And of course we can also look for the actual character that is in there:

"👩‍👩‍👧‍👦".characters.contains("👩\u{200D}") // true

(This heavily duplicates Ben Leggiero's points. I posted this before noticing he'd answered. Leaving in case it is clearer to anyone.)

114

answered Oct 23 '22 20:10

Rob Napier

It seems that Swift considers a ZWJ to be an extended grapheme cluster with the character immediately preceding it. We can see this when mapping the array of characters to their unicodeScalars:

Array(manual.characters).map { $0.description.unicodeScalars }

This prints the following from LLDB:

▿ 4 elements
  ▿ 0 : StringUnicodeScalarView("👩‍")
    - 0 : "\u{0001F469}"
    - 1 : "\u{200D}"
  ▿ 1 : StringUnicodeScalarView("👩‍")
    - 0 : "\u{0001F469}"
    - 1 : "\u{200D}"
  ▿ 2 : StringUnicodeScalarView("👧‍")
    - 0 : "\u{0001F467}"
    - 1 : "\u{200D}"
  ▿ 3 : StringUnicodeScalarView("👦")
    - 0 : "\u{0001F466}"

Additionally, .contains groups extended grapheme clusters into a single character. For instance, taking the hangul characters ᄒ, ᅡ, and ᆫ (which combine to make the Korean word for "one": 한):

"\u{1112}\u{1161}\u{11AB}".contains("\u{1112}") // false

This could not find ᄒ because the three codepoints are grouped into one cluster which acts as one character. Similarly, \u{1F469}\u{200D} (WOMAN ZWJ) is one cluster, which acts as one character.

answered Oct 23 '22 20:10

Ky.

Related questions
                            
                                What does "Fatal error: Unexpectedly found nil while unwrapping an Optional value" mean?
                            
                                How do I write dispatch_after GCD in Swift 3, 4, and 5?
                            
                                Get nth character of a string in Swift programming language
                            
                                Loading/Downloading image from URL on Swift
                            
                                How to determine the current iPhone/device model?
                            
                                How do I get the App version and build number using Swift?
                            
                                How does one generate a random number in Apple's Swift language?
                            
                                The use of Swift 3 @objc inference in Swift 4 mode is deprecated?
                            
                                Shall we always use [unowned self] inside closure in Swift
                            
                                Rounding a double value to x number of decimal places in swift
                            
                                How to find index of list item in Swift?
                            
                                How to hide UINavigationBar 1px bottom line
                            
                                Why create "Implicitly Unwrapped Optionals", since that implies you know there's a value?
                            
                                Swift: print() vs println() vs NSLog()
                            
                                How to add constraints programmatically using Swift
                            
                                Any way to replace characters on Swift String?
                            
                                Why Choose Struct Over Class?
                            
                                How would I create a UIAlertView in Swift?
                            
                                What does an exclamation mark mean in the Swift language?
                            
                                How to check if an element is in an array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With