Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pd.read_csv by default treats integers like floats

Tags:

I have a csv that looks like (headers = first row):

name,a,a1,b,b1 arnold,300311,arnld01,300311,arnld01 sam,300713,sam01,300713,sam01 

When I run:

df = pd.read_csv('file.csv') 

Columns a and b have a .0 attached to the end like so:

df.head()  name,a,a1,b,b1 arnold,300311.0,arnld01,300311.0,arnld01 sam,300713.0,sam01,300713.0,sam01 

Columns a and b are integers or blanks so why does pd.read_csv() treat them like floats and how do I ensure they are integers on the read?

like image 776
codingknob Avatar asked Sep 23 '16 17:09

codingknob


People also ask

What is the default separator in PD read_csv?

The default value of the sep parameter is the comma (,) which means if we don't specify the sep parameter in our read_csv() function, it is understood that our file is using comma as the delimiter.

What data type does read_csv return?

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument. This string can be any valid path, including URLs.

What does PD read_csv () do?

Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.

Which is faster read csv or read_csv?

csv is slowest, read_csv is 2-3x faster, and fread is 2-3x faster again.


1 Answers

As root mentioned in the comments, this is a limitation of Pandas (and Numpy). NaN is a float and the empty values you have in your CSV are NaN.

This is listed in the gotchas of pandas as well.

You can work around this in a few ways.

For the examples below I used the following to import the data - note that I added a row with an empty value in columns a and b

import pandas as pd from StringIO import StringIO  data = """name,a,a1,b,b1 arnold,300311,arnld01,300311,arnld01 sam,300713,sam01,300713,sam01 test,,test01,,test01"""  df = pd.read_csv(StringIO(data), sep=",") 

Drop NaN rows

Your first option is to drop rows that contain this NaN value. The downside of this, is that you lose the entire row. After getting your data into a dataframe, run this:

df.dropna(inplace=True) df.a = df.a.astype(int) df.b = df.b.astype(int) 

This drops all NaN rows from the dataframe, then it converts column a and column b to an int

>>> df.dtypes name    object a        int32 a1      object b        int32 b1      object dtype: object  >>> df      name       a       a1       b       b1 0  arnold  300311  arnld01  300311  arnld01 1     sam  300713    sam01  300713    sam01 

Fill NaN with placeholder data

This option will replace all your NaN values with a throw away value. That value is something you need to determine. For this test, I made it -999999. This will allow use to keep the rest of the data, convert it to an int, and make it obvious what data is invalid. You'll be able to filter these rows out if you are making calculations based on the columns later.

df.fillna(-999999, inplace=True) df.a = df.a.astype(int) df.b = df.b.astype(int) 

This produces a dataframe like so:

>>> df.dtypes name    object a        int32 a1      object b        int32 b1      object dtype: object  >>> df      name       a       a1       b       b1 0  arnold  300311  arnld01  300311  arnld01 1     sam  300713    sam01  300713    sam01 2    test -999999   test01 -999999   test01 

Leave the float values

Finally, another choice is to leave the float values (and NaN) and not worry about the non-integer data type.

like image 180
Andy Avatar answered Oct 25 '22 05:10

Andy