Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pandas remove leading zero when writing to a csv?

I have a dataframe that has a column called 'CBG' with numbers as a string value.

    CBG             acs_total_persons   acs_total_housing_units
0   010010211001    1925                1013
1   010030114011    2668                1303
2   010070100043    930                 532    

When I write it to a csv file, the leading 'O' are removed:

combine_acs_merge.to_csv(new_out_csv, sep=',')
>>> CBG: [0: 10010221101, ...]

It's already a string; how can I keep the leading zero from being removed in the .csv file

like image 849
gwydion93 Avatar asked Aug 09 '19 23:08

gwydion93


3 Answers

Lets take an example:

Below is your example DataFrame:

>>> df
    col1   num
0    One   011
1    two  0123
2  three  0122
3   four  0333

Considering the num as an int which you can convert to str().

>>> df["num"] = df["num"].astype(str)
>>> df.to_csv("datasheet.csv")

Output:

$ cat datasheet.csv

You will find the leading zeros are intacted..

,col1,num
0,One,011
1,two,0123
2,three,0122
3,four,0333

OR, if you reading the data from csv first then use belwo..

pd.read_csv('test.csv', dtype=str)

However, if your column CBG already str then it should be straight forward..

>>> df = pd.DataFrame({'CBG': ["010010211001", "010030114011", "010070100043"],
...                    'acs_total_persons': [1925, 2668, 930],
...                    'acs_total_housing_units': [1013, 1303, 532]})
>>>
>>> df
            CBG  acs_total_housing_units  acs_total_persons
0  010010211001                     1013               1925
1  010030114011                     1303               2668
2  010070100043                      532                930
>>> df.to_csv("CBG.csv")

result:

$ cat CBG.csv
,CBG,acs_total_housing_units,acs_total_persons
0,010010211001,1013,1925
1,010030114011,1303,2668
2,010070100043,532,930
like image 178
Karn Kumar Avatar answered Nov 13 '22 22:11

Karn Kumar


Pandas doesn't strip padded zeros. You're liking seeing this when opening in Excel. Open the csv in a text editor like notepad++ and you'll see they're still zero padded.

like image 7
manwithfewneeds Avatar answered Nov 13 '22 22:11

manwithfewneeds


When reading a CSV file pandas tries to convert values in every column to some data type as it sees fit. If it sees a column which contains only digits it will set the dtype of this column to int64. This converts "010010211001" to 10010211001.

If you don't want any data type conversions to happen specify dtype=str when reading in the CSV file. As per pandas documentation for read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html:

dtype : Type name or dict of column -> type, optional

    Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object
    together with suitable na_values settings to preserve and not interpret dtype. If
    converters are specified, they will be applied INSTEAD of dtype conversion.
like image 2
katsu Avatar answered Nov 13 '22 22:11

katsu