Pandas read_csv low_memory and dtype options

Tags:

When calling

df = pd.read_csv('somefile.csv')

I get:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why would making it False help with this problem?

615

asked Jun 16 '14 19:06

Josh

1 Answers

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd try:     from StringIO import StringIO except ImportError:     from io import StringIO   csvdata = """user_id,username 1,Alice 3,Bob foobar,Caesar""" sio = StringIO(csvdata) pd.read_csv(sio, dtype={"user_id": int, "username": "string"})  ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

'datetime64[ns, <tz>]' Which is a time zone aware timestamp.

'category' which is essentially an enum (strings represented by integer keys to save

'period[]' Not to be confused with a timedelta, these objects are actually anchored to specific time periods

'Sparse', 'Sparse[int]', 'Sparse[float]' is for sparse data or 'Data that has a lot of holes in it' Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

'Interval' is a topic of its own but its main use is for indexing. See more here

'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64' are all pandas specific integers that are nullable, unlike the numpy variant.

'string' is a specific dtype for working with string data and gives access to the .str attribute on the series.

'boolean' is like the numpy 'bool' but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.

answered Oct 05 '22 13:10

firelynx

Related questions
                            
                                How do I create test and train samples from one dataframe with pandas?
                            
                                Selecting/excluding sets of columns in pandas [duplicate]
                            
                                Convert Python dict into a dataframe
                            
                                What is __main__.py?
                            
                                Sorting arrays in NumPy by column
                            
                                How do I add default parameters to functions when using type hinting?
                            
                                How to re import an updated package while in Python Interpreter? [duplicate]
                            
                                How to install python3 version of package via pip on Ubuntu?
                            
                                When is del useful in Python?
                            
                                How to round to 2 decimals with Python?
                            
                                How to select all columns, except one column in pandas?
                            
                                Convert base-2 binary number string to int
                            
                                How to save a Python interactive session?
                            
                                How to extract the substring between two markers?
                            
                                Is there a way to perform "if" in python's lambda?
                            
                                Print a list in reverse order with range()?
                            
                                How do I execute a string containing Python code in Python?
                            
                                Case insensitive regular expression without re.compile?
                            
                                How to reset index in a pandas dataframe? [duplicate]
                            
                                How do I get the parent directory in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas read_csv low_memory and dtype options

Tags:

python

pandas

dataframe

parsing

numpy