Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep leading zeros in a column when reading CSV with Pandas?

I am importing study data into a Pandas data frame using read_csv.

My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").

When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.

Is there a way to import this column unchanged maybe as a string?

I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.

like image 578
user1802883 Avatar asked Nov 06 '12 11:11

user1802883


People also ask

Do Pandas columns start at 0?

By default, it adds the current row index as a new column called 'index' in DataFrame, and it will create a new row index as a range of numbers starting at 0. By default, DataFrame.

How do you read leading zeros in Python?

Use the str. zfill() Function to Display a Number With Leading Zeros in Python. The str. zfill(width) function is utilized to return the numeric string; its zeros are automatically filled at the left side of the given width , which is the sole attribute that the function takes.


2 Answers

As indicated in this question/answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.

converters={'column_name': lambda x: str(x)} 

You can refer to more options of read_csv funtion in pandas.io.parsers.read_csv documentation.

Lets say I have csv file projects.csv like below:

project_name,project_id Some Project,000245 Another Project,000478 

As for example below code is triming leading zeros:

import csv from pandas import read_csv  dataframe = read_csv('projects.csv') print dataframe 

Result:

me@ubuntu:~$ python test_dataframe.py        project_name  project_id 0     Some Project         245 1  Another Project         478 me@ubuntu:~$ 

Solution code example:

import csv from pandas import read_csv  dataframe = read_csv('projects.csv', converters={'project_id': lambda x: str(x)}) print dataframe 

Required result:

me@ubuntu:~$ python test_dataframe.py        project_name project_id 0     Some Project     000245 1  Another Project     000478 me@ubuntu:~$ 

Update as it helps others:

To have all columns as str, one can do this (from the comment):

pd.read_csv('sample.csv', dtype = str) 

To have most or selective columns as str, one can do this:

# lst of column names which needs to be string lst_str_cols = ['prefix', 'serial'] # use dictionary comprehension to make dict of dtypes dict_dtypes = {x : 'str'  for x in lst_str_cols} # use dict on dtypes pd.read_csv('sample.csv', dtype=dict_dtypes) 
like image 128
baltasvejas Avatar answered Sep 21 '22 18:09

baltasvejas


here is a shorter, robust and fully working solution:

simply define a mapping (dictionary) between variable names and desired data type:

dtype_dic= {'subject_id': str,              'subject_number' : 'float'} 

use that mapping with pd.read_csv():

df = pd.read_csv(yourdata, dtype = dtype_dic) 

et voila!

like image 38
ℕʘʘḆḽḘ Avatar answered Sep 23 '22 18:09

ℕʘʘḆḽḘ