Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specify correct dtypes to pandas.read_csv for datetimes and booleans

I am loading a csv file into a Pandas DataFrame. For each column, how do I specify what type of data it contains using the dtype argument?

  • I can do it with numeric data (code at bottom)...
  • But how do I specify time data...
  • and categorical data such as factors or booleans? I have tried np.bool_ and pd.tslib.Timestamp without luck.

Code:

import pandas as pd
import numpy as np
df = pd.read_csv(<file-name>, dtype={'A': np.int64, 'B': np.float64})
like image 441
elgehelge Avatar asked Nov 20 '13 12:11

elgehelge


People also ask

What is the default separator in PD read_csv?

The default value of the sep parameter is the comma (,) which means if we don't specify the sep parameter in our read_csv() function, it is understood that our file is using comma as the delimiter.

What is the use of Nrows argument in read_csv () method?

nrows : This parameter allows you to control how many rows you want to load from the CSV file. It takes an integer specifying row count. B. skiprows : This parameter allows you to skip rows from the beginning of the file.

What output type does pandas read_csv () return?

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument. This string can be any valid path, including URLs.

How do I select a specific column in pandas?

Use DataFrame. loc[] and DataFrame. iloc[] to select a single column or multiple columns from pandas DataFrame by column names/label or index position respectively.


1 Answers

There are a lot of options for read_csv which will handle all the cases you mentioned. You might want to try dtype={'A': datetime.datetime}, but often you won't need dtypes as pandas can infer the types.

For dates, then you need to specify the parse_date options:

parse_dates : boolean, list of ints or names, list of lists, or dict
keep_date_col : boolean, default False
date_parser : function

In general for converting boolean values you will need to specify:

true_values  : list  Values to consider as True
false_values : list  Values to consider as False

Which will transform any value in the list to the boolean true/false. For more general conversions you will most likely need

converters : dict. optional Dict of functions for converting values in certain columns. Keys can either be integers or column labels

Though dense, check here for the full list: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

like image 158
Paul Avatar answered Oct 21 '22 10:10

Paul