Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get inferred dataframe types iteratively using chunksize

How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once?

I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks.

If the first chunk stored in the table contains only int and a subsequent chunk contains a float, an exception will be raised. So I need to first iterate through the dataframe using read_csv and retain the highest inferred type. In addition, for object types, I need to retain the maximum length as these will be stored as strings in the table.

Is there a pandonic way of retaining only this information without reading in the entire dataset?

like image 566
Zelazny7 Avatar asked Mar 21 '13 18:03

Zelazny7


People also ask

How do you use Chunksize pandas?

Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.

What is Chunksize in PD read_csv?

Next, we use the python enumerate() function, pass the pd. read_csv() function as its first argument, then within the read_csv() function, we specify chunksize = 1000000, to read chunks of one million rows of data at a time.

Is CSV reader faster than pandas?

Read and write CSV datasets 7 times faster than with Pandas.

How do I read a .data file in pandas?

We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.


1 Answers

I didn't think it would be this intuitive, otherwise I wouldn't have posted the question. But once again, pandas makes things a breeze. However, keeping the question as this information might be useful to others working with large data:

In [1]: chunker = pd.read_csv('DATASET.csv', chunksize=500, header=0)

# Store the dtypes of each chunk into a list and convert it to a dataframe:

In [2]: dtypes = pd.DataFrame([chunk.dtypes for chunk in chunker])

In [3]: dtypes.values[:5]
Out[3]:
array([[int64, int64, int64, object, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64]], dtype=object)

# Very cool that I can take the max of these data types and it will preserve the hierarchy:

In [4]: dtypes.max().values
Out[4]: array([int64, int64, int64, object, int64, int64, int64, int64], dtype=object)

# I can now store the above into a dictionary:

types = dtypes.max().to_dict()

# And pass it into pd.read_csv fo the second run:

chunker = pd.read_csv('tree_prop_dset.csv', dtype=types, chunksize=500)
like image 58
Zelazny7 Avatar answered Nov 04 '22 16:11

Zelazny7