How to concat multiple pandas dataframes into one dask dataframe larger than memory?

Tags:

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.

My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.

If I use parsing my data to fit into one pandas dataframe, I would do this:

import pandas as pd
import csv   

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)

total_df = pd.DataFrame()    # create empty pandas DataFrame
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by table field:value, "dictionary_line"
    # save dictionary as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])  # one line tabular data 
    total_df = pd.concat([total_df, df])   # creates one big dataframe

Using dask to do the same task, it appears users should try something like this:

import pandas as pd
import csv 
import dask.dataframe as dd
import dask.array as da

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]   # define columns
readcsvfile = csv.reader(csvfile)       # read in file, if csv

# somehow define empty dask dataframe   total_df = dd.Dataframe()? 
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by table field:value, "dictionary_line"
    # save dictionary as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])  # one line tabular data 
    total_df = da.concatenate([total_df, df])   # creates one big dataframe

After creating a ~TB dataframe, I will save into hdf5.

My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?

Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?

EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.

491

asked Oct 09 '16 20:10

ShanZhengYang

1 Answers

Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.

If your data can be handled by pd.read_csv

The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.

In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.

Consider dask.bag

If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.

import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000)  # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)

Write to HDF5 file

Once you have dask.dataframe try the .to_hdf method:

df.to_hdf('myfile.hdf5', '/df')

190

answered Sep 18 '22 13:09

MRocklin

Related questions
                            
                                pandas concat arrays on groupby
                            
                                Applying Python function to Pandas grouped DataFrame - what's the most efficient approach to speed up the computations?
                            
                                Correct way to check if Pandas DataFrame index is a certain type (DatetimeIndex)
                            
                                Sampling groups in Pandas
                            
                                Calculating Dynamic Time Warping Distance in a Pandas Data Frame
                            
                                How to efficiently compute a rolling unique count in a pandas time series?
                            
                                Plot graph with multiple attributes similar to "hue" in Seaborn
                            
                                Is mask working differently using inplace?
                            
                                How to move my pandas dataframe to d3?
                            
                                Pandas interpolate NaNs based on different column
                            
                                Pandas - Extracting value to basic python float
                            
                                Side-by-side boxplot of multiple columns of a pandas DataFrame
                            
                                Why does pandas groupby().transform() require a unique index?
                            
                                Remove NaN 'Cells' without dropping the entire ROW (Pandas,Python3)
                            
                                difference between pandas read sql query and read sql table
                            
                                Does groupby in pandas create a copy of the data or just a view?
                            
                                Match rows in one Pandas dataframe to another based on three columns
                            
                                How to apply function to dataframe in place
                            
                                Use loc and iloc together in pandas
                            
                                ignoring rows with unmatching dtype in pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

Tags:

pandas

hdf5

dask

pytables

bigdata