pd.read_csv by default treats integers like floats

Tags:

I have a csv that looks like (headers = first row):

name,a,a1,b,b1 arnold,300311,arnld01,300311,arnld01 sam,300713,sam01,300713,sam01

When I run:

df = pd.read_csv('file.csv')

Columns a and b have a .0 attached to the end like so:

df.head()  name,a,a1,b,b1 arnold,300311.0,arnld01,300311.0,arnld01 sam,300713.0,sam01,300713.0,sam01

Columns a and b are integers or blanks so why does pd.read_csv() treat them like floats and how do I ensure they are integers on the read?

776

asked Sep 23 '16 17:09

codingknob

1 Answers

As root mentioned in the comments, this is a limitation of Pandas (and Numpy). NaN is a float and the empty values you have in your CSV are NaN.

This is listed in the gotchas of pandas as well.

You can work around this in a few ways.

For the examples below I used the following to import the data - note that I added a row with an empty value in columns a and b

import pandas as pd from StringIO import StringIO  data = """name,a,a1,b,b1 arnold,300311,arnld01,300311,arnld01 sam,300713,sam01,300713,sam01 test,,test01,,test01"""  df = pd.read_csv(StringIO(data), sep=",")

Drop NaN rows

Your first option is to drop rows that contain this NaN value. The downside of this, is that you lose the entire row. After getting your data into a dataframe, run this:

df.dropna(inplace=True) df.a = df.a.astype(int) df.b = df.b.astype(int)

This drops all NaN rows from the dataframe, then it converts column a and column b to an int

>>> df.dtypes name    object a        int32 a1      object b        int32 b1      object dtype: object  >>> df      name       a       a1       b       b1 0  arnold  300311  arnld01  300311  arnld01 1     sam  300713    sam01  300713    sam01

Fill `NaN` with placeholder data

This option will replace all your NaN values with a throw away value. That value is something you need to determine. For this test, I made it -999999. This will allow use to keep the rest of the data, convert it to an int, and make it obvious what data is invalid. You'll be able to filter these rows out if you are making calculations based on the columns later.

df.fillna(-999999, inplace=True) df.a = df.a.astype(int) df.b = df.b.astype(int)

This produces a dataframe like so:

>>> df.dtypes name    object a        int32 a1      object b        int32 b1      object dtype: object  >>> df      name       a       a1       b       b1 0  arnold  300311  arnld01  300311  arnld01 1     sam  300713    sam01  300713    sam01 2    test -999999   test01 -999999   test01

Leave the float values

Finally, another choice is to leave the float values (and NaN) and not worry about the non-integer data type.

180

answered Oct 25 '22 05:10

Andy

Related questions
                            
                                Do I have to delete lambdas?
                            
                                CUDA runtime version vs CUDA driver version - what's the difference?
                            
                                What is the Task equivalent to Promise.then()?
                            
                                Windows 10 - Task Scheduler - Not running (0x41303)
                            
                                How to embed a twitter timeline without retweets?
                            
                                Tensorflow OOM on GPU
                            
                                How can I apply the [[nodiscard]] attribute to a lambda?
                            
                                Can we detect whether a user left through the home button or lock button without listening to darwin notifications?
                            
                                What's happening in TryUpdateModelAsync
                            
                                Keras: use Tensorboard with train_on_batch()
                            
                                VueJS - Pass slot to child of child component
                            
                                Can TypeScript understand Svelte components?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pd.read_csv by default treats integers like floats

Tags:

codingknob

People also ask

1 Answers

Drop NaN rows

Fill `NaN` with placeholder data

Leave the float values

Andy

Recent Activity

Donate For Us

pd.read_csv by default treats integers like floats

Tags:

codingknob

People also ask

1 Answers

Drop NaN rows

Fill NaN with placeholder data

Leave the float values

Andy

Related questions

Recent Activity

Donate For Us

Fill `NaN` with placeholder data