I am trying to call df.set_index
in such a way that the dtype
of the column I set_index on is the new index.dtype
. Unfortunately, in the following example, set_index changes the dtype
.
df = pd.DataFrame({'a': pd.Series(np.array([-1, 0, 1, 2], dtype=np.int8))})
df['ignore'] = df['a']
assert (df.dtypes == np.int8).all() # fine
df2= df.set_index('a')
assert df2.index.dtype == df['a'].dtype, df2.index.dtype
Is it possible to avoid this behavior? My pandas version is 0.23.3
Similarly,
new_idx = pd.Index(np.array([-1, 0, 1, 2]), dtype=np.dtype('int8'))
assert new_idx.dtype == np.dtype('int64')
Even though the documentation for the dtype parameter says: "If an actual dtype is provided, we coerce to that dtype if it's safe. Otherwise, an error will be raised."
Despite my bloviating in the comments above, this might suffice to get an appropriate index that is both low memory and starts from -1
.
pandas.RangeIndex
Takes a start and stop parameters like range
df = df.set_index(pd.RangeIndex(-1, len(df) - 1))
print(df.index, df.index.dtype, sep='\n')
This should be very memory efficient.
Despite it still being of dtype
int64
(which you should want), it takes up very little memory.
pd.RangeIndex(-1, 4000000).memory_usage()
84
And
for i in range(1, 1000000, 100000):
print(pd.RangeIndex(-1, i).memory_usage())
84
84
84
84
84
84
84
84
84
84
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With