Dask dataframes: reading multiple files & storing filename in column

Tags:

I regularly use dask.dataframe to read multiple files, as so:

import dask.dataframe as dd

df = dd.read_csv('*.csv')

However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.

Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.

The idea is that different logic can then be applied depending on the source.

527

asked Feb 14 '18 20:02

jpp

2 Answers

Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:

include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.

161

answered Oct 06 '22 19:10

PeterVermont

Assuming you have or can make a file_list list that has the file path of each csv file, and each individual file fits in RAM (you mentioned 100 rows), then this should work:

import pandas as pd
import dask.dataframe as dd
from dask import delayed

def read_and_label_csv(filename):
    # reads each csv file to a pandas.DataFrame
    df_csv = pd.read_csv(filename)
    df_csv['partition'] = filename.split('\\')[-1]
    return df_csv

# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)

With some customization, of course. If your csv files are bigger-than-RAM, then a concatentation of dask.DataFrames is probably the way to go.

answered Oct 06 '22 19:10

kingfischer

Related questions
                            
                                displaying the percentile distribution as a dataframe in python
                            
                                Websocket timeout in Django Channels / Daphne
                            
                                Printing class name and score in Tensorflow Object Detection API
                            
                                Match exact phrase within a string in Python
                            
                                AWS Glue transform a struct into dynamicframe
                            
                                How do you One Hot Encode columns with a list of strings as values?
                            
                                Group by Year and Month Panda Pivot Table
                            
                                Get counts by group using pandas [duplicate]
                            
                                Access logs from docker container
                            
                                Comparing dates in python, == works but <= produces error
                            
                                PyQt Fading a QLabel
                            
                                Concatenate string to the end of all elements of a list in python
                            
                                my Keras model does not predict negative values
                            
                                Django: relation "django_site" does not exist in app with psql using sites framework
                            
                                Recursion Depth Exceeded, pickle and BeautifulSoup
                            
                                Import _tkinter or tkinter?
                            
                                How to see Python executable output in a cmd window?
                            
                                Numpy ndarray shape with 3 parameters
                            
                                ThreadPoolExecutor with context manager
                            
                                How to preserve the datatype while iterating dataframe in pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dask dataframes: reading multiple files & storing filename in column

Tags:

python

pandas

dataframe

dask

jpp

People also ask

2 Answers

PeterVermont

kingfischer

Recent Activity

Donate For Us