So I'm reading in a station codes csv file from NOAA which looks like this:
"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"
The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % i
since they are always six digits, but you know... that's the lazy mans way.
The csv is obtained using this code:
file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()
which is all well and good but when I go and try and read it using this:
import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})
or
import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})
I get a nasty error message:
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
return _read(filepath_or_buffer, kwds)
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
return parser.read()
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
ret = self._engine.read(nrows)
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
data = self._reader.read(nrows)
File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood
It's a pretty big csv (31k rows) so maybe that has something to do with it?
The reason why the leading zeroes disappear when calling read_csv(~) is that the column type is treated as an int and not as a string . The solution then is to specify the type as string for that column.
To add leading zeros to strings of a column in Pandas, use the Series' str. zfill(~) method.
In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data. csv , which you specified with the first argument. This string can be any valid path, including URLs.
This is an issue of pandas dtype guessing.
Pandas sees numbers and guesses you want it to be numbers.
To make pandas not doubt your intentions, you should set the dtype you want: object
pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})
Will do the trick
Update as it helps others:
To have all columns as str, one can do this (from the comment):
pd.read_csv('sample.csv', dtype = str)
To have most or selective columns as str, one can do this:
# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str' for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With