Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Set Intersection - Which objects are returned

Tags:

python

set

I have a question which is not really clear in python documentation (https://docs.python.org/2/library/stdtypes.html#set.intersection).

When using set.intersection the resulting set contains the objects from current set or from other? In the case that both objects have same value but are different objects in memory.

I am using this to compare a previous extraction taken from file with a new one coming from the internet. Both have some objects that are similar but I want to update the old ones. Or maybe there is a simpler alternative to achieve this? It would me much easier if sets implemented __getitem__.

    oldApsExtract = set()
    if (os.path.isfile("Apartments.json")):
        with open('Apartments.json', mode='r') as f:
            oldApsExtract = set(jsonpickle.decode(f.read()))
    newApsExtract = set(getNewExtract())

    updatedAps = oldApsExtract.intersection(newApsExtract)
    deletedAps = oldApsExtract.difference(newApsExtract)
    newAps = newApsExtract.difference(oldApsExtract)

    for ap in deletedAps:
        ap.mark_deleted()

    for ap in updatedAps:
        ap.update()

    saveAps = list(oldApsExtract) + list(newAps)
    with open('Apartments.json', mode='w') as f:
        f.write(jsonpickle.encode(saveAps))
like image 1000
husvar Avatar asked Dec 30 '15 23:12

husvar


People also ask

Which operator would you use to find the intersection of 2 sets in Python?

The intersection of two sets is the set of all the common elements of both the sets. You can use the intersection() method of the & operator to find the intersection of a Python set.

Which operator could you use to find the intersection of 2 sets?

Summary. The intersection of two or more sets returns elements that exist in all sets. Use the intersection() method or set intersection operator ( & ) to intersect two or more sets.

How do you find elements that are only in set A but not in B?

The operation which gives the elements that are only in set A but not in set B is A - B. Q. Given a non-empty set X, consider the binary operation *: P(X) × P(X) → P(X) given by A * B = A ∩ B &mnForE; A, B in P(X) is the power set of X.


1 Answers

What objects are used varies, if the sets are the same size the intersecting elements from b, if b has more elements then the objects from a are returned:

i = "$foobar" * 100
j = "$foob" * 100
l  = "$foobar" * 100
k = "$foob" * 100
print(id(i), id(j))
print(id(l), id(k))
a = {i, j}
b = {k, l, 3}
inter = a.intersection(b)
for ele in inter:
    print(id(ele))

Output:

35510304 35432016
35459968 35454928
35510304
35432016

Now when they are the same size:

i = "$foobar" * 100
j = "$foob" * 100
l  = "$foobar" * 100
k = "$foob" * 100
print(id(i), id(j))
print(id(l), id(k))
a = {i, j}
b = {k, l}
inter = a.intersection(b)
for ele in inter:
    print(id(ele))

Output:

35910288 35859984
35918160 35704816
35704816
35918160

There relevant part of the source. The line if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)), n the result of the comparison seems to decide which object to iterate over and what objects get used.

    if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) {
        tmp = (PyObject *)so;
        so = (PySetObject *)other;
        other = tmp;
    }
     
    while (set_next((PySetObject *)other, &pos, &entry)) {
        key = entry->key;
        hash = entry->hash;
        rv = set_contains_entry(so, key, hash);
        if (rv < 0) {
            Py_DECREF(result);
            return NULL;
        }
        if (rv) {
            if (set_add_entry(result, key, hash)) {
                Py_DECREF(result);
                return NULL;
            }

If you pass an object that is not a set, then the same is not true and the length is irrelevant as the objects from the iterable are used:

it = PyObject_GetIter(other);
if (it == NULL) {
    Py_DECREF(result);
    return NULL;
}

while ((key = PyIter_Next(it)) != NULL) {
    hash = PyObject_Hash(key);
    if (hash == -1)
        goto error;
    rv = set_contains_entry(so, key, hash);
    if (rv < 0)
        goto error;
    if (rv) {
        if (set_add_entry(result, key, hash))
            goto error;
    }
    Py_DECREF(key);

When you pass an iterable, firstly it could be an iterator so you could not check the size without consuming and if you passed a list then the lookup would be 0(n) so it makes sense to just iterate over the iterable passed in, in contrast if you have a set of 1000000 elements and one with 10, it makes sense to check if the 10 are in the set if 1000000 as opposed to checking if any of 1000000 are in your set of 10 as the lookup should be 0(1) on average so it mean a linear pass over 10 vs a linear pass over 1000000 elements.

If you look at the wiki.python.org/moin/TimeComplexity, this is backed up:

Average case -> Intersection s&t O(min(len(s), len(t))

Worst case -> O(len(s) * len(t))O(len(s) * len(t))

replace "min" with "max" if t is not a set

So when we pass an iterable we should always get the objects from b:

i = "$foobar" * 100
j = "$foob" * 100
l  = "$foobar" * 100
k = "$foob" * 100
print(id(i), id(j))
print(id(l), id(k))
a = {i, j}
b = [k, l, 1,2,3]
inter = a.intersection(b)
for ele in inter:
    print(id(ele))

You get the objects from b:

20854128 20882896
20941072 20728768
20941072
20728768

If you really want to decide what objects you keep then do the iterating and lookups yourself keeping whichever you want.

like image 195
Padraic Cunningham Avatar answered Oct 05 '22 23:10

Padraic Cunningham