So I'm reading in a station codes csv file from NOAA which looks like this: <pre class="prettyprint"><code>"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END" "006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","","" "007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127" </code></pre> The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like <code>"%06d" % i</code> since they are always six digits, but you know... that's the lazy mans way. The csv is obtained using this code: <pre class="prettyprint"><code>file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV") output = open('Station Codes.csv','wb') output.write(file.read()) output.close() </code></pre> which is all well and good but when I go and try and read it using this: <pre class="prettyprint"><code>import pandas as pd df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str}) </code></pre> or <pre class="prettyprint"><code>import pandas as pd df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str}) </code></pre> I get a nasty error message: <pre class="prettyprint"><code>File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser _f return _read(filepath_or_buffer, kwds) File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read return parser.read() File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read ret = self._engine.read(nrows) File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read data = self._reader.read(nrows) File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931) File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148) File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962) File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898) File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483) File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535) File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616) TypeError: data type not understood </code></pre> It's a pretty big csv (31k rows) so maybe that has something to do with it?

This is an issue of pandas dtype guessing. Pandas sees numbers and guesses you want it to be numbers. To make pandas not doubt your intentions, you should set the dtype you want: <code>object</code> <pre class="prettyprint"><code>pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object}) </code></pre> Will do the trick Update as it helps others: To have all columns as str, one can do this (from the comment): <pre class="prettyprint"><code>pd.read_csv('sample.csv', dtype = str) </code></pre> To have most or selective columns as str, one can do this: <pre class="prettyprint"><code># lst of column names which needs to be string lst_str_cols = ['prefix', 'serial'] # use dictionary comprehension to make dict of dtypes dict_dtypes = {x : 'str' for x in lst_str_cols} # use dict on dtypes pd.read_csv('sample.csv', dtype=dict_dtypes) </code></pre>

Pandas read_csv dtype leading zeros

Tags:

python

string

pandas

csv

So I'm reading in a station codes csv file from NOAA which looks like this:

"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"

The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % i since they are always six digits, but you know... that's the lazy mans way.

The csv is obtained using this code:

file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

which is all well and good but when I go and try and read it using this:

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})

I get a nasty error message:

File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
    return parser.read()
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood

It's a pretty big csv (31k rows) so maybe that has something to do with it?

374

asked Jun 04 '13 23:06

Radical Edward

1 Answers

This is an issue of pandas dtype guessing.

Pandas sees numbers and guesses you want it to be numbers.

To make pandas not doubt your intentions, you should set the dtype you want: object

pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})

Will do the trick

Update as it helps others:

To have all columns as str, one can do this (from the comment):

pd.read_csv('sample.csv', dtype = str)

To have most or selective columns as str, one can do this:

# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)

176

answered Oct 15 '22 11:10

firelynx

Related questions
                            
                                Python Redis interaction
                            
                                Django Haystack - Show results without needing a search query?
                            
                                How do I pass variables to all templates in django? [duplicate]
                            
                                Matplotlib: Repositioning a subplot in a grid of subplots
                            
                                Django Query distinct values works but i cant use the query result
                            
                                Warnings and errors after trying to install Flask 0.9
                            
                                Tastypie : Authentication for GET and Anonymous for POST
                            
                                Convert binary to list of digits Python
                            
                                Check if a string is valid absolute path address format
                            
                                Get data from plot with matplotlib
                            
                                Is a string formatter that pulls variables from its calling scope bad practice?
                            
                                How to split a list into subsets based on a pattern?
                            
                                Initializing a large list of booleans in Python [duplicate]
                            
                                Matplotlib contour from xyz data: griddata invalid index
                            
                                VRML to X3D Conversion
                            
                                Python Tkinter Canvas fail to bind keyboard
                            
                                Get dot-product of dataframe with vector, and return dataframe, in Pandas
                            
                                How can you make an adjacency matrix which would emulate a 2d grid
                            
                                Multiple logical comparisons on a single line for an if statement
                            
                                How to do a symbolic taylor expansion of an unknown function $f(x)$ using sympy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With