I want to create a pandas dataframe with default values of zero, but one column of integers and the other of floats. I am able to create a numpy array with the correct types, see the values
variable below. However, when I pass that into the dataframe constructor, it only returns NaN values (see df
below). I have include the untyped code that returns an array of floats(see df2
)
import pandas as pd
import numpy as np
values = np.zeros((2,3), dtype='int32,float32')
index = ['x', 'y']
columns = ['a','b','c']
df = pd.DataFrame(data=values, index=index, columns=columns)
df.values.dtype
values2 = np.zeros((2,3))
df2 = pd.DataFrame(data=values2, index=index, columns=columns)
df2.values.dtype
Any suggestions on how to construct the dataframe?
Can an array store different data types? Yes, a numpy array can store different data String, Integer, Complex, Float, Boolean.
To convert an array to a dataframe with Python you need to 1) have your NumPy array (e.g., np_array), and 2) use the pd. DataFrame() constructor like this: df = pd. DataFrame(np_array, columns=['Column1', 'Column2']) .
You can also create a NumPy array with specific dtypes and then convert it to DataFrame. Show activity on this post. As an alternative, you can specify the dtype for each column by creating the Series objects first.
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.
Here are a few options you could choose from:
import numpy as np
import pandas as pd
index = ['x', 'y']
columns = ['a','b','c']
# Option 1: Set the column names in the structured array's dtype
dtype = [('a','int32'), ('b','float32'), ('c','float32')]
values = np.zeros(2, dtype=dtype)
df = pd.DataFrame(values, index=index)
# Option 2: Alter the structured array's column names after it has been created
values = np.zeros(2, dtype='int32, float32, float32')
values.dtype.names = columns
df2 = pd.DataFrame(values, index=index, columns=columns)
# Option 3: Alter the DataFrame's column names after it has been created
values = np.zeros(2, dtype='int32, float32, float32')
df3 = pd.DataFrame(values, index=index)
df3.columns = columns
# Option 4: Use a dict of arrays, each of the right dtype:
df4 = pd.DataFrame(
{'a': np.zeros(2, dtype='int32'),
'b': np.zeros(2, dtype='float32'),
'c': np.zeros(2, dtype='float32')}, index=index, columns=columns)
# Option 5: Concatenate DataFrames of the simple dtypes:
df5 = pd.concat([
pd.DataFrame(np.zeros((2,), dtype='int32'), columns=['a']),
pd.DataFrame(np.zeros((2,2), dtype='float32'), columns=['b','c'])], axis=1)
# Option 6: Alter the dtypes after the DataFrame has been formed. (This is not very efficient)
values2 = np.zeros((2, 3))
df6 = pd.DataFrame(values2, index=index, columns=columns)
for col, dtype in zip(df6.columns, 'int32 float32 float32'.split()):
df6[col] = df6[col].astype(dtype)
Each of the options above produce the same result
a b c
x 0 0 0
y 0 0 0
with dtypes:
a int32
b float32
c float32
dtype: object
Why pd.DataFrame(values, index=index, columns=columns)
produces a DataFrame with NaNs:
values
is a structured array with column names f0
, f1
, f2
:
In [171]: values
Out[172]:
array([(0, 0.0, 0.0), (0, 0.0, 0.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<f4')])
If you pass the argument columns=['a', 'b', 'c']
to pd.DataFrame
, then Pandas will look for columns with those names in the structured array values
. When those columns are not found, Pandas places NaN
s in the DataFrame to represent missing values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With