I have a DataFrame df1
:
df1.head() =
wght num_links
id_y id_x
3 133 0.000203 2
186 0.000203 2
5 6 0.000203 2
98 0.000203 2
184 0.000203 2
I need to calculate a variable called thr
,
thr = N*(N-1)*2,
where N
is the number of rows of df1
.
The problem is that when I calculate thr
,Python throws a negative value(although all of the inputs are positive):
ipdb> df1['wght'].count()*(df1['wght'].count()-1)*2
-712569744
Possible hint
The number of rows N is
ipdb> df1['wght'].count()
137736
therefore,
ipdb> 137736*137735*2
37942135920.
Taking into account that the max value that can be assigned to a int32
is 2147483647
, I suspect that NumPy considers type(thr) = <int32>
, when it should be <int64>
. Does this make sense?
Please note that I have not written the code that generates df1
because
ipdb> df1['wght'].count()
137736
However, if it is needed to reproduce the error, let me know.
Thanks in advance.
You are experiencing np.int32
overflow, so just use len(df)
instead of df.column.count()
.
Here is a small demo:
In [149]: x = pd.DataFrame(np.random.randint(0,100,size=(137736, 3)), columns=list('ABC'))
In [150]: x.A.count() * (x.A.count() - 1) * 2
Out[150]: -712569744
In [151]: len(x) * (len(x) - 1) * 2
Out[151]: 37942135920
In [153]: type(x.A.count())
Out[153]: numpy.int32
In [154]: type(len(x))
Out[154]: int
If you get the type of count()
(i.e. type(df1['wght'].count())
) you'll receive:
<class 'numpy.int32'>
So try your computation with:
n = df1['wght'].count().astype(np.int64)
n*(n-1)*2
You can pass df1['wght'].count()
to long constructor like this, to ensure it is long.
N = long(df1['wght'].count())
Although storing to any variable
N = df1['wght'].count()
should work as the class int has a __mul__
method (which implements *) that creates a long result when required.
Also Python 3.x has "unified" int and long which also takes care of the bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With