Get inferred dataframe types iteratively using chunksize

Tags:

How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once?

I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks.

If the first chunk stored in the table contains only int and a subsequent chunk contains a float, an exception will be raised. So I need to first iterate through the dataframe using read_csv and retain the highest inferred type. In addition, for object types, I need to retain the maximum length as these will be stored as strings in the table.

Is there a pandonic way of retaining only this information without reading in the entire dataset?

566

asked Mar 21 '13 18:03

Zelazny7

1 Answers

I didn't think it would be this intuitive, otherwise I wouldn't have posted the question. But once again, pandas makes things a breeze. However, keeping the question as this information might be useful to others working with large data:

In [1]: chunker = pd.read_csv('DATASET.csv', chunksize=500, header=0)

# Store the dtypes of each chunk into a list and convert it to a dataframe:

In [2]: dtypes = pd.DataFrame([chunk.dtypes for chunk in chunker])

In [3]: dtypes.values[:5]
Out[3]:
array([[int64, int64, int64, object, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64]], dtype=object)

# Very cool that I can take the max of these data types and it will preserve the hierarchy:

In [4]: dtypes.max().values
Out[4]: array([int64, int64, int64, object, int64, int64, int64, int64], dtype=object)

# I can now store the above into a dictionary:

types = dtypes.max().to_dict()

# And pass it into pd.read_csv fo the second run:

chunker = pd.read_csv('tree_prop_dset.csv', dtype=types, chunksize=500)

answered Nov 04 '22 16:11

Zelazny7

Related questions
                            
                                Custom ticks autoscaled when using imshow?
                            
                                Calling MATLAB .m-files and functions in Python script
                            
                                Connecting to Microsoft SQL Server through pyODBC on Ubuntu
                            
                                py2exe with multiprocessing fails to run the processes
                            
                                python method stealer
                            
                                Flask App Using WTForms with SelectMultipleField
                            
                                python vs octave random generator
                            
                                SQLAlchemy eating RAM
                            
                                Python Selenium WebDriver how to add timeout to get(url) function
                            
                                Print to standard console in curses
                            
                                Python and argparse: how to vary required additional arguments based on the first argument?
                            
                                How to work with interactively-defined classes in IPython.parallel?
                            
                                Model History in Django
                            
                                Unreliable results with cv2.HoughCircles
                            
                                Loading .coverage files from coverage.py into IntelliJ IDEA/PyCharm's coverage view
                            
                                blender game engine import error
                            
                                Most efficient way to filter a long time series Python
                            
                                Return data from html/js to python
                            
                                Sockets vs Standard Streams for local client-server communication
                            
                                Ruby optparse Limitations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get inferred dataframe types iteratively using chunksize

Tags:

python

type-conversion

pandas

hdfstore

Zelazny7

People also ask

1 Answers

Zelazny7

Recent Activity

Donate For Us