Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert pandas DataFrame to record array without the extra column

I have the following code

import numpy as np
import pandas as pd

test_array = np.array([(1, 2, 3), (4, 5, 6)], 
                      dtype={'names': ('a', 'b', 'c'), 'formats': ('f8', 'f8', 'f8')})
test_df = pd.DataFrame.from_records(test_array)
test_df.to_records().view(np.float64).reshape(test_array.shape + (-1, ))

I expect a view on the original test_array to be returned, with shape (2, 3), however, I get this (2, 4) array.

rec.array([[0.e+000, 1.e+000, 2.e+000, 3.e+000],
           [5.e-324, 4.e+000, 5.e+000, 6.e+000]],
          dtype=float64)

Where did the extra column, column 0, come from?

Edit: I've just learned I can use DataFrame.values() to do the same thing, but I remain curious why this behavior exists.

like image 710
mnosefish Avatar asked Mar 04 '23 23:03

mnosefish


2 Answers

If you need a record array, use np.rec.fromrecords:

np.rec.fromrecords(test_df, names=[*test_df])
# rec.array([(1., 2., 3.), (4., 5., 6.)],
#          dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

My tests show that this is faster than df.to_records by some.

like image 88
cs95 Avatar answered Mar 06 '23 13:03

cs95


to_records is capturing the index too. Note that this is stated in the docs:

Index will be included as the first field of the record array if requested

If you want to exlude it simply set index=False.


Although in your case you can simply use to_numpy (or values):

test_df.to_numpy().view(np.float64).reshape(test_array.shape + (-1, ))

array([[1., 2., 3.],
       [4., 5., 6.]])
like image 39
yatu Avatar answered Mar 06 '23 11:03

yatu