There is one thing that I find myself having to do quite often, and it surprises me how difficult it is to achieve this in Pandas. Suppose I need to create an empty DataFrame
with specified index type and name, and column types and names. (I might want to fill it later, in a loop for example.) The easiest way to do this, that I have found, is to create an empty pandas.Series
object for each column, specifying their dtype
s, put them into a dictionary which specifies their names, and pass the dictionary into the DataFrame
constructor. Something like the following.
def create_empty_dataframe():
index = pandas.Index([], name="id", dtype=int)
column_names = ["name", "score", "height", "weight"]
series = [pandas.Series(dtype=str), pandas.Series(dtype=int), pandas.Series(dtype=float), pandas.Series(dtype=float)]
columns = dict(zip(column_names, series))
return pandas.DataFrame(columns, index=index, columns=column_names)
# The columns=column_names is required because the dictionary will in general put the columns in arbitrary order.
First question. Is the above really the simplest way of doing this? There are so many things that are convoluted about this. What I really want to do, and what I'm pretty sure a lot of people really want to do, is something like the following.
df = pandas.DataFrame(columns=["id", "name", "score", "height", "weight"], dtypes=[int, str, int, float, float], index_column="id")
Second question. Is this sort of syntax at all possible in Pandas? If not, are the devs considering supporting something like this at all? It feels to me that it really ought to be as simple as this (the above syntax).
Cast a pandas object to a specified dtype dtype . Use a numpy. dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.
You can create an empty dataframe by importing pandas from the python library. Later, using the pd. DataFrame(), create an empty dataframe without rows and columns as shown in the below example.
You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.
Unfortunately the DateFrame
ctor accepts a single dtype
descriptor, however you can cheat a little by using read_csv
:
In [143]:
import pandas as pd
import io
cols=["id", "name", "score", "height", "weight"]
df = pd.read_csv(io.StringIO(""), names=cols, dtype=dict(zip(cols,[int, str, int, float, float])), index_col=['id'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 4 columns):
name 0 non-null object
score 0 non-null int32
height 0 non-null float64
weight 0 non-null float64
dtypes: float64(2), int32(1), object(1)
memory usage: 0.0+ bytes
So you can see that the dtypes are as desired and that the index is set as desired:
In [145]:
df.index
Out[145]:
Int64Index([], dtype='int64', name='id')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With