Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Case-insensitive comparison of sets in Python

I have two sets (although I can do lists, or whatever):

a = frozenset(('Today','I','am','fine'))
b = frozenset(('hello','how','are','you','today'))

I want to get:

frozenset(['Today'])

or at least:

frozenset(['today'])

The second option is doable if I lowercase everything I presume, but I'm looking for a more elegant way. Is it possible to do

a.intersection(b) 

in a case-insensitive manner?

Shortcuts in Django are also fine since I'm using that framework.

Example from intersection method below (I couldn't figure out how to get this formatted in a comment):

print intersection('Today I am fine tomorrow'.split(),
                    'Hello How a re you TODAY and today and Today and Tomorrow'.split(),
                    key=str.lower)

[(['tomorrow'], ['Tomorrow']), (['Today'], ['TODAY', 'today', 'Today'])]
like image 729
Adam Nelson Avatar asked Sep 25 '09 23:09

Adam Nelson


People also ask

How do you make a comparison case-insensitive in Python?

Using the casefold() method is the strongest and the most aggressive approach to string comparison in Python. It's similar to lower() , but it removes all case distinctions in strings. This is a more efficient way to make case-insensitive comparisons in Python.

Are Sets case-sensitive Python?

All names in Python are case-sensitive: variable names, function names, class names, module names, exception names. If you can get it, set it, call it, construct it, import it, or raise it, it's case-sensitive.

How do you do case-insensitive comparison?

Comparing strings in a case insensitive manner means to compare them without taking care of the uppercase and lowercase letters. To perform this operation the most preferred method is to use either toUpperCase() or toLowerCase() function. toUpperCase() function: The str.

How do I fix case sensitivity in Python?

To remove the case sensitivity, use the string. lower() or string. upper() function.


2 Answers

Here's version that works for any pair of iterables:

def intersection(iterableA, iterableB, key=lambda x: x):
    """Return the intersection of two iterables with respect to `key` function.

    """
    def unify(iterable):
        d = {}
        for item in iterable:
            d.setdefault(key(item), []).append(item)
        return d

    A, B = unify(iterableA), unify(iterableB)

    return [(A[k], B[k]) for k in A if k in B]

Example:

print intersection('Today I am fine'.split(),
                   'Hello How a re you TODAY'.split(),
                   key=str.lower)
# -> [(['Today'], ['TODAY'])]
like image 91
jfs Avatar answered Nov 02 '22 23:11

jfs


Unfortunately, even if you COULD "change on the fly" the comparison-related special methods of the sets' items (__lt__ and friends -- actually, only __eq__ needed the way sets are currently implemented, but that's an implementatio detail) -- and you can't, because they belong to a built-in type, str -- that wouldn't suffice, because __hash__ is also crucial and by the time you want to do your intersection it's already been applied, putting the sets' items in different hash buckets from where they'd need to end up to make intersection work the way you want (i.e., no guarantee that 'Today' and 'today' are in the same bucket).

So, for your purposes, you inevitably need to build new data structures -- if you consider it "inelegant" to have to do that at all, you're plain out of luck: built-in sets just don't carry around the HUGE baggage and overhead that would be needed to allow people to change comparison and hashing functions, which would bloat things by 10 times (or more) for the sae of a need felt in (maybe) one use case in a million.

If you have frequent needs connected with case-insensitive comparison, you should consider subclassing or wrapping str (overriding comparison and hashing) to provide a "case insensitive str" type cistr -- and then, of course, make sure than only instances of cistr are (e.g.) added to your sets (&c) of interest (either by subclassing set &c, or simply by paying care). To give an oversimplified example...:

class ci(str):
  def __hash__(self):
    return hash(self.lower())
  def __eq__(self, other):
    return self.lower() == other.lower()

class cifrozenset(frozenset):
  def __new__(cls, seq=()):
    return frozenset((ci(x) for x in seq))

a = cifrozenset(('Today','I','am','fine'))
b = cifrozenset(('hello','how','are','you','today'))

print a.intersection(b)

this does emit frozenset(['Today']), as per your expressed desire. Of course, in real life you'd probably want to do MUCH more overriding (for example...: the way I have things here, any operation on a cifrozenset returns a plain frozenset, losing the precious case independence special feature -- you'd probably want to ensure that a cifrozenset is returned each time instead, and, while quite feasible, that's NOT trivial).

like image 40
Alex Martelli Avatar answered Nov 02 '22 23:11

Alex Martelli