I have a data frame with alpha-numeric keys which I want to save as a csv and read back later. For various reasons I need to explicitly read this key column as a string format, I have keys which are strictly numeric or even worse, things like: 1234E5 which Pandas interprets as a float. This obviously makes the key completely useless. The problem is when I specify a string dtype for the data frame or any column of it I just get garbage back. I have some example code here: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.rand(2,2), index=['1A', '1B'], columns=['A', 'B']) df.to_csv(savefile) </code></pre> The data frame looks like: <pre class="prettyprint"><code> A B 1A 0.209059 0.275554 1B 0.742666 0.721165 </code></pre> Then I read it like so: <pre class="prettyprint"><code>df_read = pd.read_csv(savefile, dtype=str, index_col=0) </code></pre> and the result is: <pre class="prettyprint"><code> A B B ( < </code></pre> Is this a problem with my computer, or something I'm doing wrong here, or just a bug?

Update: this has been fixed: from 0.11.1 you passing <code>str</code>/<code>np.str</code> will be equivalent to using <code>object</code>. Use the object dtype: <pre class="prettyprint"><code>In [11]: pd.read_csv('a', dtype=object, index_col=0) Out[11]: A B 1A 0.35633069074776547 0.745585398803751 1B 0.20037376323337375 0.013921830784260236 </code></pre> or better yet, just don't specify a dtype: <pre class="prettyprint"><code>In [12]: pd.read_csv('a', index_col=0) Out[12]: A B 1A 0.356331 0.745585 1B 0.200374 0.013922 </code></pre> but bypassing the type sniffer and truly returning only strings requires a hacky use of <code>converters</code>: <pre class="prettyprint"><code>In [13]: pd.read_csv('a', converters={i: str for i in range(100)}) Out[13]: A B 1A 0.35633069074776547 0.745585398803751 1B 0.20037376323337375 0.013921830784260236 </code></pre> where <code>100</code> is some number equal or greater than your total number of columns. It's best to avoid the str dtype, see for example here.

Pandas reading csv as string type

Tags:

python

type-conversion

casting

pandas

dtype

I have a data frame with alpha-numeric keys which I want to save as a csv and read back later. For various reasons I need to explicitly read this key column as a string format, I have keys which are strictly numeric or even worse, things like: 1234E5 which Pandas interprets as a float. This obviously makes the key completely useless.

The problem is when I specify a string dtype for the data frame or any column of it I just get garbage back. I have some example code here:

df = pd.DataFrame(np.random.rand(2,2),                   index=['1A', '1B'],                   columns=['A', 'B']) df.to_csv(savefile)

The data frame looks like:

           A         B 1A  0.209059  0.275554 1B  0.742666  0.721165

Then I read it like so:

df_read = pd.read_csv(savefile, dtype=str, index_col=0)

and the result is:

   A  B B  (  <

Is this a problem with my computer, or something I'm doing wrong here, or just a bug?

768

asked Jun 07 '13 16:06

daver

2 Answers

Update: this has been fixed: from 0.11.1 you passing str/np.str will be equivalent to using object.

Use the object dtype:

In [11]: pd.read_csv('a', dtype=object, index_col=0) Out[11]:                       A                     B 1A  0.35633069074776547     0.745585398803751 1B  0.20037376323337375  0.013921830784260236

or better yet, just don't specify a dtype:

In [12]: pd.read_csv('a', index_col=0) Out[12]:            A         B 1A  0.356331  0.745585 1B  0.200374  0.013922

but bypassing the type sniffer and truly returning only strings requires a hacky use of converters:

In [13]: pd.read_csv('a', converters={i: str for i in range(100)}) Out[13]:                       A                     B 1A  0.35633069074776547     0.745585398803751 1B  0.20037376323337375  0.013921830784260236

where 100 is some number equal or greater than your total number of columns.

It's best to avoid the str dtype, see for example here.

102

answered Nov 02 '22 07:11

Andy Hayden

Like Anton T said in his comment, pandas will randomly turn object types into float types using its type sniffer, even you pass dtype=object, dtype=str, or dtype=np.str.

Since you can pass a dictionary of functions where the key is a column index and the value is a converter function, you can do something like this (e.g. for 100 columns).

pd.read_csv('some_file.csv', converters={i: str for i in range(0, 100)})

You can even pass range(0, N) for N much larger than the number of columns if you don't know how many columns you will read.

answered Nov 02 '22 08:11

Chris Conlan

Related questions
                            
                                Get a preview JPEG of a PDF on Windows?
                            
                                Using the multiprocessing module for cluster computing
                            
                                Python relative-import script two levels up
                            
                                Is there a way to use Python unit test assertions outside of a TestCase?
                            
                                Why are arbitrary target expressions allowed in for-loops?
                            
                                Python type() or __class__, == or is
                            
                                Is python += string concatenation bad practice?
                            
                                Why can't environmental variables set in python persist?
                            
                                Add "b" prefix to python variable?
                            
                                AttributeError: module 'urllib' has no attribute 'parse'
                            
                                Get Rows based on distinct values from Column 2
                            
                                Parallelism in Julia: Native Threading Support
                            
                                Equivalent of Python's dir in Javascript
                            
                                Seaborn load_dataset
                            
                                How to specify python version used to create Virtual Environment?
                            
                                How do I operate on a DataFrame with a Series for every column?
                            
                                How to call a async function contained in a class?
                            
                                Two forward slashes in Python
                            
                                Could not find a version that satisfies the requirement pytz
                            
                                OSError: [WinError 193] %1 is not a valid Win32 application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With