How to create Dask DataFrame from a list of urls?

Question

I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http. Is there any way to achieve that?

Here is an example:

link = 'http://web.mta.info/developers/'

data = [     'data/nyct/turnstile/turnstile_170128.txt',
                        'data/nyct/turnstile/turnstile_170121.txt',
                        'data/nyct/turnstile/turnstile_170114.txt',
                        'data/nyct/turnstile/turnstile_170107.txt' 
        ]

and what I want is

df = dd.read_csv('XXXX*X')

MRocklin · Accepted Answer

Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn those lazy values into a full dask dataframe

import pandas as pd
import dask
import dask.dataframe as dd

dfs = [dask.delayed(pd.read_csv)(url) for url in urls]

df = dd.from_delayed(dfs)

This will read one of your links immediately in order to figure out metadata (column, dtypes). If you know these dtypes and links ahead of time then you can avoid this by passing a sample empty dataframe to dd.from_delayed(..., meta=sample_df)

See also: http://dask.pydata.org/en/latest/delayed-collections.html

How to create Dask DataFrame from a list of urls?

Tags:

python

pandas

dask

Philipp_Kats

1 Answers

MRocklin

Recent Activity

Donate For Us

How to create Dask DataFrame from a list of urls?

Tags:

python

pandas

dask

Philipp_Kats

1 Answers

MRocklin

Related questions

Recent Activity

Donate For Us