I want to bring some data into a pandas DataFrame and I want to assign dtypes for each column on import. I want to be able to do this for larger datasets with many different columns, but, as an example: <pre class="prettyprint"><code>myarray = np.random.randint(0,5,size=(2,2)) mydf = pd.DataFrame(myarray,columns=['a','b'], dtype=[float,int]) mydf.dtypes </code></pre> results in: <blockquote> TypeError: data type not understood </blockquote> I tried a few other methods such as: <pre class="prettyprint"><code>mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int}) </code></pre> <blockquote> TypeError: object of type 'type' has no len() </blockquote> If I put <code>dtype=(float,int)</code> it applies a float format to both columns. In the end I would like to just be able to pass it a list of datatypes the same way I can pass it a list of column names.

I just ran into this, and the pandas issue is still open, so I'm posting my workaround. Assuming <code>df</code> is my DataFrame and <code>dtype</code> is a dict mapping column names to types: <pre class="prettyprint"><code>for k, v in dtype.items(): df[k] = df[k].astype(v) </code></pre> (note: use <code>dtype.iteritems()</code> in python 2) For the reference: <ul> <li>The list of allowed data types (NumPy <code>dtypes</code>): https://docs.scipy.org/doc/numpy-1.12.0/reference/arrays.dtypes.html </li> <li>Pandas also supports some other types. E.g., <code>category</code>: http://pandas.pydata.org/pandas-docs/stable/categorical.html </li> <li>The relevant GitHub issue: https://github.com/pandas-dev/pandas/issues/9287 </li> </ul>

How to set dtypes by column in pandas DataFrame

Tags:

python

types

pandas

I want to bring some data into a pandas DataFrame and I want to assign dtypes for each column on import. I want to be able to do this for larger datasets with many different columns, but, as an example:

myarray = np.random.randint(0,5,size=(2,2)) mydf = pd.DataFrame(myarray,columns=['a','b'], dtype=[float,int]) mydf.dtypes

results in:

TypeError: data type not understood

I tried a few other methods such as:

mydf = pd.DataFrame(myarray,columns=['a','b'], dtype={'a': int})

TypeError: object of type 'type' has no len()

If I put dtype=(float,int) it applies a float format to both columns.

In the end I would like to just be able to pass it a list of datatypes the same way I can pass it a list of column names.

410

asked Sep 01 '14 17:09

Chris

2 Answers

I just ran into this, and the pandas issue is still open, so I'm posting my workaround. Assuming df is my DataFrame and dtype is a dict mapping column names to types:

for k, v in dtype.items():     df[k] = df[k].astype(v)

(note: use dtype.iteritems() in python 2)

For the reference:

The list of allowed data types (NumPy dtypes): https://docs.scipy.org/doc/numpy-1.12.0/reference/arrays.dtypes.html
Pandas also supports some other types. E.g., category: http://pandas.pydata.org/pandas-docs/stable/categorical.html
The relevant GitHub issue: https://github.com/pandas-dev/pandas/issues/9287

answered Sep 30 '22 22:09

mattexx

As of pandas version 0.24.2 (the current stable release) it is not possible to pass an explicit list of datatypes to the DataFrame constructor as the docs state:

dtype : dtype, default None      Data type to force. Only a single dtype is allowed. If None, infer

However, the dataframe class does have a static method allowing you to convert a numpy structured array to a dataframe so you can do:

>>> myarray = np.random.randint(0,5,size=(2,2)) >>> record = np.array(map(tuple,myarray),dtype=[('a',np.float),('b',np.int)]) >>> mydf = pd.DataFrame.from_records(record) >>> mydf.dtypes a    float64 b      int64 dtype: object

answered Sep 30 '22 23:09

user545424

Related questions
                            
                                Tensorflow - matmul of input matrix with batch data
                            
                                Python readline() from a string?
                            
                                Pandas finding local max and min
                            
                                Pycharm: set environment variable for run manage.py Task
                            
                                How to test if a given time-stamp is in seconds or milliseconds?
                            
                                How do I return an image in fastAPI?
                            
                                How do I remove all zero elements from a NumPy array?
                            
                                How can I set the x-axis as datetimes on a bokeh plot?
                            
                                permutations of two lists in python
                            
                                OpenAI Gym Atari on Windows
                            
                                Mergesort with Python
                            
                                python pip trouble installing from requirements.txt
                            
                                How do I get the active window on Gnome Wayland?
                            
                                what is XLA_GPU and XLA_CPU for tensorflow
                            
                                Is there an accepted way to use API keys for authentication in Flask? [closed]
                            
                                Is there a convention to distinguish Python integration tests from unit tests?
                            
                                Why no 'const' in Python? [closed]
                            
                                Python packaging: wheels vs tarball (tar.gz)
                            
                                Python socket.error: [Errno 111] Connection refused
                            
                                sklearn and large datasets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With