Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can pd.DataFrame.set_index mantain dtype?

Tags:

python

pandas

I am trying to call df.set_index in such a way that the dtype of the column I set_index on is the new index.dtype. Unfortunately, in the following example, set_index changes the dtype.

df = pd.DataFrame({'a': pd.Series(np.array([-1, 0, 1, 2], dtype=np.int8))})
df['ignore'] = df['a']
assert (df.dtypes == np.int8).all() # fine
df2=  df.set_index('a')
assert df2.index.dtype == df['a'].dtype, df2.index.dtype

Is it possible to avoid this behavior? My pandas version is 0.23.3

Similarly,

new_idx = pd.Index(np.array([-1, 0, 1, 2]), dtype=np.dtype('int8'))
assert new_idx.dtype == np.dtype('int64')

Even though the documentation for the dtype parameter says: "If an actual dtype is provided, we coerce to that dtype if it's safe. Otherwise, an error will be raised."

like image 288
Sam Shleifer Avatar asked Nov 07 '22 02:11

Sam Shleifer


1 Answers

Despite my bloviating in the comments above, this might suffice to get an appropriate index that is both low memory and starts from -1.

pandas.RangeIndex

Takes a start and stop parameters like range

df = df.set_index(pd.RangeIndex(-1, len(df) - 1))

print(df.index, df.index.dtype, sep='\n')

This should be very memory efficient.

Despite it still being of dtype int64 (which you should want), it takes up very little memory.

pd.RangeIndex(-1, 4000000).memory_usage()

84

And

for i in range(1, 1000000, 100000):
  print(pd.RangeIndex(-1, i).memory_usage())

84
84
84
84
84
84
84
84
84
84
like image 125
piRSquared Avatar answered Nov 15 '22 12:11

piRSquared