idx_level_0 = pd.date_range('2020-01-01', '2020-04-01', freq = 'M')
idx_level_1 = pd.date_range('2020-04-01', '2020-07-01', freq = 'M')
idx_dates = pd.MultiIndex.from_product([idx_level_0, idx_level_1], names = ['Event_Date', 'Observation_Date'])
ser_info_dated = pd.Series(range(len(idx_level_0) * len(idx_level_1)), index = idx_dates, name = 'Some_Values') / 33
list_levels_dates = sorted(list(set(idx_level_0) | set(idx_level_1)))
dict_to_numbers = dict(zip(list_levels_dates, range(len(list_levels_dates))))
df_info_numbered = ser_info_dated.reset_index().replace({'Event_Date': dict_to_numbers, 'Observation_Date': dict_to_numbers})
df_info_downcasted = df_info_numbered.copy()
df_info_downcasted[['Event_Date', 'Observation_Date']] = df_info_downcasted[['Event_Date', 'Observation_Date']].astype('int16')
It seemes to be a success:
print('df_info_downcasted column types:\n', df_info_downcasted.dtypes)
shows such a result:
df_info_downcasted column types:
Event_Date int16
Observation_Date int16
Some_Values float64
ser_info_downcasted = df_info_downcasted.set_index(['Event_Date', 'Observation_Date']).squeeze()
print('ser_info_downcasted index level 0 type: ', ser_info_downcasted.index.levels[0].dtype)
print('ser_info_downcasted index level 1 type: ', ser_info_downcasted.index.levels[1].dtype)
ser_info_downcasted index level 0 type: int64
ser_info_downcasted index level 1 type: int64
ser_info_astyped = ser_info_downcasted.copy()
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[0].astype('int16'), level = 0)
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[1].astype('int16'), level = 1)
print('ser_info_astyped index level 0 type: ', ser_info_astyped.index.levels[0].dtype)
print('ser_info_astyped index level 1 type: ', ser_info_astyped.index.levels[1].dtype)
ser_info_astyped index level 0 type: int64
ser_info_astyped index level 1 type: int64
TL;DR: Pandas converts indexes to 64 byte value, best chance to minimize file is with HDF serialization.
Pandas does not seem to support int16 dtype as an index.
Int64Indexis a fundamental basic index in pandas. This is an immutable array implementing an ordered, sliceable set.
source
This is further reinforced in pandas.Index.astype.
Note that any signed integer dtype is treated as
'int64', and any unsigned integer dtype is treated as'uint64', regardless of the size.
source
So essentially our int16 values gets casted to int64 when set as an index. I have not worked with HDF5 before, but I tried to see what can be done to minimize the file size.
Looking at memory allocation
>>> print(ser_info_dated)
... Event_Date Observation_Date
... 2020-01-31 2020-04-30 0.000000
... 2020-05-31 0.030303
... 2020-06-30 0.060606
... 2020-02-29 2020-04-30 0.090909
... 2020-05-31 0.121212
... 2020-06-30 0.151515
... 2020-03-31 2020-04-30 0.181818
... 2020-05-31 0.212121
... 2020-06-30 0.242424
>>> print(ser_info_dated.memory_usage(index=True, deep=True))
... 478 # memory usage in bytes
vs
>>> print(df_info_downcasted)
... Event_Date Observation_Date Some_Values
... 0 0 3 0.000000
... 1 0 4 0.030303
... 2 0 5 0.060606
... 3 1 3 0.090909
... 4 1 4 0.121212
... 5 1 5 0.151515
... 6 2 3 0.181818
... 7 2 4 0.212121
... 8 2 5 0.242424
>>> print(df_info_downcasted.memory_usage(index=True, deep=True))
... Index 128
... Event_Date 18
... Observation_Date 18
... Some_Values 72
... dtype: int64
>>> print(df_info_downcasted.info())
... <class 'pandas.core.frame.DataFrame'>
... RangeIndex: 9 entries, 0 to 8
... Data columns (total 3 columns):
... # Column Non-Null Count Dtype
... --- ------ -------------- -----
... 0 Event_Date 9 non-null int16
... 1 Observation_Date 9 non-null int16
... 2 Some_Values 9 non-null float64
... dtypes: float64(1), int16(2)
... memory usage: 236.0 bytes
We can see that most of the memory is used in the index. When saving as a HDF5 it does not seem to matter though. (I increased the range to be 12H for testing purposes)
>>> ser_info_dated.to_hdf("ser.h5", "ser")
>>> print(f"{os.path.getsize('ser.h5')/1000} kb")
... 412.628 kb
vs
>>> down_set_hdf5 = df_info_downcasted.to_hdf("down.h5", "down")
>>> print(f"{os.path.getsize('down.h5')/1000} kb")
... 413.204 kb
I do not know how you save HDF5 files, but in Pandas there is a compression argument complevel which can be useful.
>>> ser_hdf5 = ser_info_dated.to_hdf("ser.h5", "ser", complevel=9)
>>> print(f"{os.path.getsize('ser.h5')/1000} kb")
... 155.452 kb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With