Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get an exact representation of floats during `DataFrame.to_json`?

I observed the following behavior with DataFrame.to_json:

>>> df = pd.DataFrame([[eval(f'1.12345e-{i}') for i in range(8, 20)]])
>>> df
             0             1             2             3             4             5             6             7             8             9             10            11
0  1.123450e-08  1.123450e-09  1.123450e-10  1.123450e-11  1.123450e-12  1.123450e-13  1.123450e-14  1.123450e-15  1.123450e-16  1.123450e-17  1.123450e-18  1.123450e-19
>>> print(df.to_json(indent=2, orient='index'))
{
  "0":{
    "0":0.0000000112,
    "1":0.0000000011,
    "2":0.0000000001,
    "3":0.0,
    "4":0.0,
    "5":0.0,
    "6":0.0,
    "7":0.0,
    "8":1.12345e-16,
    "9":1.12345e-17,
    "10":1.12345e-18,
    "11":1.12345e-19
  }
}

So all numbers down to 1e-16 seem to be rounded to 10 decimal places (in agreement with the default value for double_precision) but all smaller values are represented exactly. Why is this the case and how can I turn off decimal rounding for the larger values too (i.e. using scientific notation instead)?


>>> pd.__version__
'1.3.1'

For reference, the standard library's json module doesn't do this:

>>> import json
>>> print(json.dumps([eval(f'1.12345e-{i}') for i in range(8, 20)], indent=2))
[
  1.12345e-08,
  1.12345e-09,
  1.12345e-10,
  1.12345e-11,
  1.12345e-12,
  1.12345e-13,
  1.12345e-14,
  1.12345e-15,
  1.12345e-16,
  1.12345e-17,
  1.12345e-18,
  1.12345e-19
]
like image 253
a_guest Avatar asked Aug 24 '21 10:08

a_guest


People also ask

How do you format a float in Python?

Format float value using the round() Method in Python The round() is a built-in Python method that returns the floating-point number rounded off to the given digits after the decimal point. You can use the round() method to format the float value.

How do you make a float into a data frame?

Use pandas DataFrame. astype() function to convert column from string/int to float, you can apply this on a specific column or on an entire DataFrame. To cast the data type to 54-bit signed float, you can use numpy. float64 , numpy.

How do you round a float in Python?

Round() Round() is a built-in function available with python. It will return you a float number that will be rounded to the decimal places which are given as input. If the decimal places to be rounded are not specified, it is considered as 0, and it will round to the nearest integer.


Video Answer


3 Answers

I'm not sure on achieving this with pd.DataFrame.to_json, but we can use pd.DataFrame.to_dict, json, and pd.read_json to achieve a full precision json representation from a pandas dataframe.

json_df = json.dumps(df.to_dict('index'), indent=2)
>>> print(json_df)
{
  "0": {
    "0": 1.12345e-08,
    "1": 1.12345e-09,
    "2": 1.12345e-10,
    "3": 1.12345e-11,
    "4": 1.12345e-12,
    "5": 1.12345e-13,
    "6": 1.12345e-14,
    "7": 1.12345e-15,
    "8": 1.12345e-16,
    "9": 1.12345e-17,
    "10": 1.12345e-18,
    "11": 1.12345e-19
  }
}

To read it back in, we can then do:

>>> pd.read_json(json_df, orient='index')
             0             1             2   ...            9             10            11
0  1.123450e-08  1.123450e-09  1.123450e-10  ...  1.123450e-17  1.123450e-18  1.123450e-19

[1 rows x 12 columns]
like image 100
maneblusser Avatar answered Oct 21 '22 16:10

maneblusser


It refers to /pandas/io/json/_json.py codebase, by default the precision integer is upto 10, please see the below from the codebase..

def to_json(
    path_or_buf,
    obj,
    orient: Optional[str] = None,
    date_format: str = "epoch",
    double_precision: int = 10,
    force_ascii: bool = True,
    date_unit: str = "ms",
    default_handler: Optional[Callable[[Any], JSONSerializable]] = None,
    lines: bool = False,
    compression: Optional[str] = "infer",
    index: bool = True,
    indent: int = 0,

If you apply max precision then, you will be getting below..

>>> print(df.to_json(indent=2, orient='records', double_precision=15))
[
  {
    "0":0.0000000112345,
    "1":0.00000000112345,
    "2":0.000000000112345,
    "3":0.000000000011234,
    "4":0.000000000001123,
    "5":0.000000000000112,
    "6":0.000000000000011,
    "7":0.000000000000001,
    "8":1.12345e-16,
    "9":1.12345e-17,
    "10":1.12345e-18,
    "11":1.12345e-19,
    "12":1.12345e-20,
    "13":1.12345e-21,
    "14":1.12345e-22,
    "15":1.12345e-23,
    "16":1.12345e-24,
    "17":1.12345e-25,
    "18":1.12345e-26,
    "19":1.12345e-27,
    "20":1.12345e-28,
    "21":1.12345e-29,
    "22":1.12345e-30,
    "23":1.12345e-31,
    "24":1.12345e-32,
    "25":1.12345e-33,
    "26":1.12345e-34,
    "27":1.12345e-35,
    "28":1.12345e-36,
    "29":1.12345e-37,
    "30":1.12345e-38,
    "31":1.12345e-39
  }
]

Note: you will get value error if you use precision more than 15.

ValueError: Invalid value '20' for option 'double_precision', max is '15'

So, in a sense this is not same as json.dumps.

like image 42
Karn Kumar Avatar answered Oct 21 '22 15:10

Karn Kumar


pd.DataFrame.to_json uses internal library pandas._libs.json and not a standard json module. This explains the difference in behavior. The former "normalizes" the numbers internally and does not expose API to control it. As a result, you have following options:

Either convert to dictionary and dump using standard json library (as mentioned earlier):

>>> print(json.dumps(df.to_dict(orient='records'), indent=2))

[
  {
    "0": 1.12345e-08,
    "1": 1.12345e-09,
    "2": 1.12345e-10,
    "3": 1.12345e-11,
    "4": 1.12345e-12,
    "5": 1.12345e-13,
    "6": 1.12345e-14,
    "7": 1.12345e-15,
    "8": 1.12345e-16,
    "9": 1.12345e-17,
    "10": 1.12345e-18,
    "11": 1.12345e-19
  }
]

This is perfectly legitimate solution.

You can use CSV format instead of JSON and specify desired floating point format:

>>> print(df.to_csv(float_format='%.10e', index=False))

0,1,2,3,4,5,6,7,8,9,10,11
1.1234500000e-08,1.1234500000e-09,1.1234500000e-10,1.1234500000e-11,1.1234500000e-12,1.1234500000e-13,1.1234500000e-14,1.1234500000e-15,1.1234500000e-16,1.1234500000e-17,1.1234500000e-18,1.1234500000e-19

Another option is to convert values to string before the "normalization" kicks in:

>>> print(df.astype(str).to_json(indent=2, orient='index'))

{
  "0":{
    "0":"1.12345e-08",
    "1":"1.12345e-09",
    "2":"1.12345e-10",
    "3":"1.12345e-11",
    "4":"1.12345e-12",
    "5":"1.12345e-13",
    "6":"1.12345e-14",
    "7":"1.12345e-15",
    "8":"1.12345e-16",
    "9":"1.12345e-17",
    "10":"1.12345e-18",
    "11":"1.12345e-19"
  }
}

The conversion to string will require special attention when reading back the JSON.

And finally, if you need the exact values, just use binary format such as parquet or pickle.

like image 36
igrinis Avatar answered Oct 21 '22 16:10

igrinis