Mixed types when reading csv files. Causes, fixes and consequences

Tags:

What exactly happens when Pandas issues this warning? Should I worry about it?

In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139: 
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.              

  data = self._reader.read(nrows)

I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?

Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?

Finally, how exactly does low_memory=False fix the problem?

469

asked Aug 25 '14 14:08

Amelio Vazquez-Reina

1 Answers

Revisiting mbatchkarov's link, low_memory is not deprecated. It is now documented:

low_memory : boolean, default True

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)

I have asked what resulting in mixed type inference means, and chris-b1 answered:

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

So, what type does Pandas end up using for those columns?

This is answered by the following self-contained example:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

The first part of the csv data was seen as only int, so converted to int, the second part also had a string, so all entries were kept as string.

Can the type always be recovered after the fact? (after getting the warning)?

I guess re-exporting to csv and re-reading with low_memory=False should do the job.

How exactly does low_memory=False fix the problem?

It reads all of the file before deciding the type, therefore needing more memory.

answered Oct 26 '22 22:10

Robert Pollak

Related questions
                            
                                Python and dictionary like object
                            
                                Configuring Python's default exception handling
                            
                                Jinja-like for Pdf in Python
                            
                                how to safely generate a SQL LIKE statement using python db-api
                            
                                How to serialize db.Model objects to json?
                            
                                Numpy histogram of large arrays
                            
                                Matplotlib: interactive plot on a web server
                            
                                Python equivalent for C++ STL vector/list containers
                            
                                How to get flat clustering corresponding to color clusters in the dendrogram created by scipy
                            
                                How do I output a config value in a Sphinx .rst file?
                            
                                What algorithm is used when using the in operator in python to search a list?
                            
                                How can I display a np.array with pylab.imshow()
                            
                                Detect if an Active Directory user account is locked using LDAP in Python
                            
                                Tunneling httplib Through a Proxy
                            
                                Ignore certificate validation with urllib3
                            
                                pip install customized include path
                            
                                convert a django queryset into an array
                            
                                TypeError: '_sre.SRE_Match' object has no attribute '__getitem__'
                            
                                Check requirements for python 3 support
                            
                                RuntimeError: cannot access configuration outside request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Mixed types when reading csv files. Causes, fixes and consequences

Tags:

python

pandas

csv

Amelio Vazquez-Reina

People also ask

1 Answers

Robert Pollak

Recent Activity

Donate For Us