Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"add to set" returns a boolean in java - what about python?

Tags:

java

python

set

In Java I like to use the Boolean value returned by an "add to the set" operation to test whether the element was already present in the set:

if (set.add("x")) {
   print "x was not yet in the set";
}

My question is, is there something as convenient in Python? I tried

 z = set()
 if (z.add(y)):
     print something

But it does not print anything. Am I missing something? Thx!

like image 553
seinecle Avatar asked Nov 28 '11 15:11

seinecle


4 Answers

In Python, the set.add() method does not return anything. You have to use the not in operator:

z = set()
if y not in z: # If the object is not in the list yet...
    print something
z.add(y)

If you really need to know whether the object was in the set before you added it, just store the boolean value:

z = set()
was_here = y not in z
z.add(y)
if was_here: # If the object was not in the list yet...
    print something

However, I think it is unlikely you need it.

This is a Python convention: when a method updates some object, it returns None. You can ignore this convention; also, there are methods "in the wild" that violate it. However, it is a common, recognized convention: I'd recommend to stick to it and have it in mind.

like image 89
brandizzi Avatar answered Oct 23 '22 00:10

brandizzi


As mentioned in the previous answers, the add method for Python sets does not return anything. By the way, this exact question was discussed on the Python mailing list: http://mail.python.org/pipermail/python-ideas/2009-February/002877.html.

like image 22
Vilas Avatar answered Oct 23 '22 01:10

Vilas


Generally, Python tries to avoid using conditions with side-effects. That is, the condition should be just a test, and operations that change data should happen on their own.

I agree that it's sometimes convenient to use a side-effect in a condition, but no, in this case, you need to:

z = set()
if y not in z:
    z.add(y)
    print something

Personally I like the simple expressiveness of if y not in z:, even if it takes up one more line of code, and it's one less thing to mentally parse when reading the the code at a later date.

like image 2
N3dst4 Avatar answered Oct 23 '22 02:10

N3dst4


Apparently, set().add(elem) always returns None and it has NoneType type, as per this:

$ python
Python 3.10.7 (main, Sep  7 2022, 01:54:01) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> type(set().add(12345))
<class 'NoneType'>

If you're wanting to avoid the double lookup, you might use len() to detect whether or not the element was added by checking the length of the set before and after the addition:

a_set = set()
#...
something=12345
pre_len = len(a_set)
a_set.add(something) #always returns None
if pre_len != len(a_set):
    print(f"The element({something}) was added therefore it was not already in the set.")
else:
    print(f"The element({something}) was not added because it was already in the set.")

I found this info about len():

How does len() work? len() works in O(1) time as the set is an object and has a member to store its size. Below is description of len() from Python docs.

Return the length (the number of items) of an object. The argument may be a sequence (such as a string, bytes, tuple, list, or range) or a collection (such as a dictionary, set, or frozen set).

source: https://www.geeksforgeeks.org/find-the-length-of-a-set-in-python/

I was expecting the length to be cached somehow and the above seems to confirm it.

Furthermore (thanks to the info and confirmation provided by the people on the #python channel on liberachat IRC) I've also confirmed it myself by looking at the source code, so now I know for sure that len() returns cached result.

There's also this page on the python wiki that confirms it: that the Operation Get Length(for a list, but affects all collections) is O(1) time complexity(even in worst case) and that the Operation x in s is between O(1) and O(n)(Worst Case) where 'n' is the number of elements currently in the container.

Below I tried to see the timing difference, but I don't know if I'm doing it right, and maybe I'm missing something but the len() variant seems a bit slower(it either takes 9.4% or 51.7% more time than the x in set variant). Here it is anyway:

what =========== time it took in seconds (underscores for readability/matching)
-----
init_set() highest time==========================5.374_412_298_202_514_648_438_
init_set() average time==========================5.374_412_298_202_514_648_438_
init_set() lowest time===========================5.374_412_298_202_514_648_438_
re_set() highest time============================1.283_961_534_500_122_070_312_
re_set() average time============================1.233_813_919_126_987_457_275_
re_set() lowest time=============================1.027_379_512_786_865_234_375_
use_len_and_always_add() highest time============0.000_089_168_548_583_984_375_
use_len_and_always_add() average time============0.000_002_022_699_288_719_765_*
use_len_and_always_add() lowest time=============0.000_001_668_930_053_710_938_
double_lookup_never_add() highest time===========0.000_107_288_360_595_703_125_
double_lookup_never_add() average time===========0.000_001_333_327_164_114_879_*
double_lookup_never_add() lowest time============0.000_000_953_674_316_406_250_
double_lookup_always_add() highest time==========0.000_087_261_199_951_171_875_
double_lookup_always_add() average time==========0.000_001_681_037_948_399_603_*
double_lookup_always_add() lowest time===========0.000_001_192_092_895_507_812_
double_lookup_never_add2() highest time==========0.000_120_401_382_446_289_062_
double_lookup_never_add2() average time==========0.000_001_423_196_238_256_303_*
double_lookup_never_add2() lowest time===========0.000_001_192_092_895_507_812_
using_len_many() highest time===================10.652_944_326_400_756_835_938_
using_len_many() average time===================10.642_948_746_681_213_378_906_
using_len_many() lowest time====================10.632_953_166_961_669_921_875_
double_lookup_always_add_many() highest time====10.278_126_478_195_190_429_688_
double_lookup_always_add_many() average time====10.234_028_577_804_565_429_688_
double_lookup_always_add_many() lowest time=====10.189_930_677_413_940_429_688_
double_lookup_always_add_many2() highest time===10.584_211_587_905_883_789_062_
double_lookup_always_add_many2() average time===10.565_821_886_062_622_070_312_
double_lookup_always_add_many2() lowest time====10.547_432_184_219_360_351_562_
double_lookup_many() highest time================9.843_203_306_198_120_117_188_
double_lookup_many() average time================9.723_606_467_247_009_277_344_
double_lookup_many() lowest time=================9.604_009_628_295_898_437_500_

The above is sorted manually(and the CPU was set to 800Mhz via sudo cpupower frequency-set --related --governor powersave --min 800MHz --max 800MHz) but the output is from this code:

#!/usr/bin/python3



import time as t
WORST="() highest time"
AVG="() average time"
BEST="() lowest time"

stats: dict[str,float]=dict({})

def TimeTakenDecorator(func):
    #src: https://stackoverflow.com/a/70954147/19999437
    def wraper(*args,**kwargs):
        global stats
        tmp_avg=stats.get(func.__name__+AVG)
        #too_fast:bool
        if (None == tmp_avg) or (tmp_avg > 0.5):
            print(f'Calling "{func.__name__}()"') #,end="")
            #too_fast=False
        #else:
            #too_fast=True
        start = t.time()
        func(*args,**kwargs)
        end = t.time()
        diff=end - start
        if None == stats.get(func.__name__+WORST):
            stats[func.__name__+WORST]=diff
        if None == tmp_avg:
            stats[func.__name__+AVG]=diff
            tmp_avg=diff
        if None == stats.get(func.__name__+BEST):
            stats[func.__name__+BEST]=diff
        if diff > stats[func.__name__+WORST]:
            stats[func.__name__+WORST]=diff
        if diff < stats[func.__name__+BEST]:
            stats[func.__name__+BEST]=diff
        stats[func.__name__+AVG]=(tmp_avg + diff) /2
        if diff > 0.5:
            print(f'Elapsed time for function call "{func.__name__}": {diff:.20f}')
            #print(" "+str(diff))
        #if not too_fast:
        #    print()
    return wraper

something=5_234_567
REPEATS=1_000_000

#init_set() highest time==========================5.374_412_298_202_514_648_438_
#init_set() average time==========================5.374_412_298_202_514_648_438_
#init_set() lowest time===========================5.374_412_298_202_514_648_438_
@TimeTakenDecorator
def init_set():
    #print("Initializing the set:")
    global g1_set
    g1_set = set()
    for i in range(10_000_000):
        g1_set.add(i)

#re_set() highest time============================1.283_961_534_500_122_070_312_
#re_set() average time============================1.233_813_919_126_987_457_275_
#re_set() lowest time=============================1.027_379_512_786_865_234_375_
@TimeTakenDecorator
def re_set():
    #print("Resetting the set:")
    global g2_set
    global g1_set
    g2_set=g1_set.copy()

#double_lookup_many() highest time================9.843_203_306_198_120_117_188_
#double_lookup_many() average time================9.723_606_467_247_009_277_344_
#double_lookup_many() lowest time=================9.604_009_628_295_898_437_500_
@TimeTakenDecorator
def double_lookup_many():
    #print("Using double lookup:")
    for i in range(REPEATS):
        double_lookup_never_add()

#double_lookup_always_add_many() highest time====10.278_126_478_195_190_429_688_
#double_lookup_always_add_many() average time====10.234_028_577_804_565_429_688_
#double_lookup_always_add_many() lowest time=====10.189_930_677_413_940_429_688_
@TimeTakenDecorator
def double_lookup_always_add_many():
    #print("Using double lookup:")
    for i in range(REPEATS):
        double_lookup_always_add()

#using_len_many() highest time===================10.652_944_326_400_756_835_938_
#using_len_many() average time===================10.642_948_746_681_213_378_906_
#using_len_many() lowest time====================10.632_953_166_961_669_921_875_
@TimeTakenDecorator
def using_len_many():
    #print("Using len():")
    for i in range(REPEATS):
        use_len_and_always_add()

#double_lookup_never_add() highest time===========0.000_107_288_360_595_703_125_
#double_lookup_never_add() average time===========0.000_001_333_327_164_114_879_
#double_lookup_never_add() lowest time============0.000_000_953_674_316_406_250_
@TimeTakenDecorator
def double_lookup_never_add():
    global g2_set
    if something not in g2_set:
        g2_set.add(something)
        #pass
        #print(f"The element({something}) was added therefore it was not already in the set.")
    #else:
        #g2_set.add(something)
        #pass
        #print(f"The element({something}) was not added because it was already in the set.")

#double_lookup_always_add() highest time==========0.000_087_261_199_951_171_875_
#double_lookup_always_add() average time==========0.000_001_681_037_948_399_603_
#double_lookup_always_add() lowest time===========0.000_001_192_092_895_507_812_
@TimeTakenDecorator
def double_lookup_always_add():
    global g2_set
    if something not in g2_set:
        g2_set.add(something)
    else:
        g2_set.add(something)

#use_len_and_always_add() highest time============0.000_089_168_548_583_984_375_
#use_len_and_always_add() average time============0.000_002_022_699_288_719_765_
#use_len_and_always_add() lowest time=============0.000_001_668_930_053_710_938_
@TimeTakenDecorator
def use_len_and_always_add():
    global g2_set
    pre_len = len(g2_set)
    g2_set.add(something)
    #pass
    if pre_len != len(g2_set):
        pass
        #print(f"The element({something}) was added therefore it was not already in the set.")
    #else:
    #    pass
        #print(f"The element({something}) was not added because it was already in the set.")

#double_lookup_never_add2() highest time==========0.000_120_401_382_446_289_062_
#double_lookup_never_add2() average time==========0.000_001_423_196_238_256_303_
#double_lookup_never_add2() lowest time===========0.000_001_192_092_895_507_812_
@TimeTakenDecorator
def double_lookup_never_add2():
    global g2_set
    if something not in g2_set:
        g2_set.add(something)

#double_lookup_always_add_many2() highest time===10.584_211_587_905_883_789_062_
#double_lookup_always_add_many2() average time===10.565_821_886_062_622_070_312_
#double_lookup_always_add_many2() lowest time====10.547_432_184_219_360_351_562_
@TimeTakenDecorator
def double_lookup_always_add_many2():
    global g2_set
    for i in range(REPEATS):
        g2_set.clear()
        double_lookup_never_add2()


def main():
    init_set()
    re_set()
    using_len_many()
    re_set()
    double_lookup_many()
    re_set()
    double_lookup_always_add_many()
    re_set()
    double_lookup_always_add_many2()
    print("Once more in reverse order:")
    re_set()
    double_lookup_always_add_many2()
    re_set()
    double_lookup_always_add_many()
    re_set()
    double_lookup_many()
    re_set()
    using_len_many()
    import json
    #from json import encoder
    #encoder.FLOAT_REPR = lambda o: f'{o:.20f}' #.format(o) #format(o, '.20f')
    #src: https://stackoverflow.com/a/69056325/19999437
    class RoundingFloat(float):
        __repr__ = staticmethod(lambda x: format(x, '.30f'))

    json.encoder.c_make_encoder = None
    #if hasattr(json.encoder, 'FLOAT_REPR'):
    #    # Python 2
    #    json.encoder.FLOAT_REPR = RoundingFloat.__repr__
    #else:
        # Python 3
    json.encoder.float = RoundingFloat
    print(json.dumps(stats, sort_keys=False, indent=2)) #, parse_float=lambda x: f"{x:.20f}"))
    import re
    for k in stats:
        time_form=re.sub(r'([0-9_]+\.)?([0-9]{3})', '\\1\\2_', f"{stats[k]:_.21f}")
        #print(f"{k:-<45} {stats[k]:.20f}")
        print(f"{k:=<45}={time_form:=>33}")


main();
like image 2
correabuscar Avatar answered Oct 23 '22 01:10

correabuscar