I am doing a competition where I am provided data that is anonymized. Quite a few of the columns have HUGE values. The largest was 40 digits long! I used pd.read_csv
but those columns have been converted to objects as a result.
My original plan was to scale the data down but since they are seen as objects I can't do arithmetic on these.
Does anyone have a suggestion on how to handle huge numbers in Pandas or Numpy?
Note that I've tried converting the value to a uint64
with no luck. I get the error "long too big to convert"
If you have a mixed-type column -- some integers, some strings -- stored in a dtype=object column, you can still convert to ints and perform arithmetic. Starting from a mixed-type column:
>>> df = pd.DataFrame({"A": [11**44, "11"*22]})
>>> df
A
0 6626407607736641103900260617069258125403649041
1 11111111111111111111111111111111111111111111
[2 rows x 1 columns]
>>> df.dtypes, list(map(type, df.A))
(A object
dtype: object, [<type 'long'>, <type 'str'>])
We can convert to ints:
>>> df["A"] = df["A"].apply(int)
>>> df.dtypes, list(map(type, df.A))
(A object
dtype: object, [<type 'long'>, <type 'long'>])
>>> df
A
0 6626407607736641103900260617069258125403649041
1 11111111111111111111111111111111111111111111
[2 rows x 1 columns]
And then perform arithmetic:
>>> df // 11
A
0 602400691612421918536387328824478011400331731
1 1010101010101010101010101010101010101010101
[2 rows x 1 columns]
You can use Pandas converters to call int
or some other custom converter function on the string as they are being imported:
import pandas as pd
from StringIO import StringIO
txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,1,"Tiny"
4,-9999999999999999999999999999999999999999,"Really negative"
'''
df=pd.read_csv(StringIO(txt), converters={'Big_Num':int})
print df
Prints:
line Big_Num text
0 1 1234567890123456789012345678901234567890 That sure is a big number
1 2 9999999999999999999999999999999999999999 That is an even BIGGER number
2 3 1 Tiny
3 4 -9999999999999999999999999999999999999999 Really negative
Now test arithmetic:
n=df["Big_Num"][1]
print n,n+1
Prints:
9999999999999999999999999999999999999999 10000000000000000000000000000000000000000
If you have any values in the column that might cause int
to croak, you can do this:
txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,0.000000000000000001,"Tiny"
4,"a string","Use 0 for strings"
'''
def conv(s):
try:
return int(s)
except ValueError:
try:
return float(s)
except ValueError:
return 0
df=pd.read_csv(StringIO(txt), converters={'Big_Num':conv})
print df
Prints:
line Big_Num text
0 1 1234567890123456789012345678901234567890 That sure is a big number
1 2 9999999999999999999999999999999999999999 That is an even BIGGER number
2 3 1e-18 Tiny
3 4 0 Use 0 for strings
Then every value in the column will be either a Python int or a float and will support arithmetic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With