Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas changing values when inferring dtypes

Tags:

python

pandas

I came across the following problem:

I have this file which is structured as a jsonlines file:

{"id": 1, "uuid": "1344800117571260417"}
{"id": 2, "uuid": "1344900117571260918"}

If I try to read it with Pandas like this:

df = pd.read_json('file.jsonl', orient='records', lines=True)

I get the following DataFrame:

   id                 uuid
0   1  1344800117571260416
1   2  1344900117571260928

But the uuid has different values, I am thinking of some overflow happening here, but I am not sure. The type inferred by pandas for that column is int64, but np.iinfo(np.int64).max is 9223372036854775807, which is way higher than the values from the uuid column.

An immediate solution to this problem is to disable inferring the types, like pd.read_json(..., dtype=False), but I am curious about this unexpected behavior, does anyone know why this is happening?

BTW, I am using pandas version 1.0.1 and python version 3.7.6.

like image 349
Giovanni Rescia Avatar asked Apr 16 '21 22:04

Giovanni Rescia


1 Answers

As posted in the comments, pandas does int(float(x)), which is the reason of the bug. I filed a ticket to report the bug, you can check it out here.

like image 174
Giovanni Rescia Avatar answered Oct 10 '22 00:10

Giovanni Rescia